2023-06-23 17:26:52,791 INFO [train.py:1064] (2/4) Training started 2023-06-23 17:26:52,791 INFO [train.py:1074] (2/4) Device: cuda:2 2023-06-23 17:26:55,745 INFO [lexicon.py:168] (2/4) Loading pre-compiled data/lang_char/Linv.pt 2023-06-23 17:26:56,361 INFO [train.py:1085] (2/4) {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.1', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'c51a0b9684442a88ee37f3ce0af686a04b66855b', 'k2-git-date': 'Mon May 1 21:38:03 2023', 'lhotse-version': '1.14.0.dev+git.0f812851.dirty', 'torch-version': '1.10.0+cu102', 'torch-cuda-available': True, 'torch-cuda-version': '10.2', 'python-version': '3.8', 'icefall-git-branch': 'zipformer_wenetspeech', 'icefall-git-sha1': '63e53ba-dirty', 'icefall-git-date': 'Wed Jun 21 18:13:24 2023', 'icefall-path': '/star-kw/kangwei/code/icefall_wenetspeech', 'k2-path': '/ceph-hw/kangwei/code/k2_release/k2/k2/python/k2/__init__.py', 'lhotse-path': '/ceph-hw/kangwei/dev_tools/anaconda3/envs/rnnt2/lib/python3.8/site-packages/lhotse-1.14.0.dev0+git.0f812851.dirty-py3.8.egg/lhotse/__init__.py', 'hostname': 'de-74279-k2-train-6-0423201309-7c68fd68fb-6cszs', 'IP address': '10.177.28.83'}, 'world_size': 4, 'master_port': 12536, 'tensorboard': True, 'num_epochs': 12, 'start_epoch': 6, 'start_batch': 0, 'exp_dir': PosixPath('zipformer/exp_L_small'), 'lang_dir': PosixPath('data/lang_char'), 'base_lr': 0.045, 'lr_batches': 7500, 'lr_epochs': 1.5, 'ref_duration': 600, 'context_size': 2, 'prune_range': 5, 'lm_scale': 0.25, 'am_scale': 0.0, 'simple_loss_scale': 0.5, 'seed': 42, 'print_diagnostics': False, 'inf_check': False, 'save_every_n': 4000, 'keep_last_k': 30, 'average_period': 200, 'use_fp16': True, 'num_encoder_layers': '2,2,2,2,2,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,768,768,768,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,256,256,256,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,192,192,192,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 900, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts': True, 'num_workers': 8, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'training_subset': 'L', 'blank_id': 0, 'vocab_size': 5537} 2023-06-23 17:26:56,362 INFO [train.py:1087] (2/4) About to create model 2023-06-23 17:26:57,104 INFO [train.py:1091] (2/4) Number of model parameters: 32327030 2023-06-23 17:26:57,106 INFO [checkpoint.py:112] (2/4) Loading checkpoint from zipformer/exp_L_small/epoch-5.pt 2023-06-23 17:27:09,299 INFO [train.py:1106] (2/4) Using DDP 2023-06-23 17:27:09,622 INFO [train.py:1118] (2/4) Loading optimizer state dict 2023-06-23 17:27:10,137 INFO [train.py:1126] (2/4) Loading scheduler state dict 2023-06-23 17:27:10,137 INFO [asr_datamodule.py:390] (2/4) About to get train cuts 2023-06-23 17:27:10,140 INFO [asr_datamodule.py:398] (2/4) About to get dev cuts 2023-06-23 17:27:10,142 INFO [asr_datamodule.py:211] (2/4) About to get Musan cuts 2023-06-23 17:27:13,438 INFO [asr_datamodule.py:216] (2/4) Enable MUSAN 2023-06-23 17:27:13,438 INFO [asr_datamodule.py:239] (2/4) Enable SpecAugment 2023-06-23 17:27:13,438 INFO [asr_datamodule.py:240] (2/4) Time warp factor: 80 2023-06-23 17:27:13,439 INFO [asr_datamodule.py:250] (2/4) Num frame mask: 10 2023-06-23 17:27:13,439 INFO [asr_datamodule.py:263] (2/4) About to create train dataset 2023-06-23 17:27:13,439 INFO [asr_datamodule.py:289] (2/4) Using DynamicBucketingSampler. 2023-06-23 17:27:18,980 INFO [asr_datamodule.py:305] (2/4) About to create train dataloader 2023-06-23 17:27:18,982 INFO [asr_datamodule.py:336] (2/4) About to create dev dataset 2023-06-23 17:27:19,914 INFO [asr_datamodule.py:354] (2/4) About to create dev dataloader 2023-06-23 17:27:19,914 INFO [train.py:1206] (2/4) Loading grad scaler state dict 2023-06-23 17:29:33,359 INFO [train.py:996] (2/4) Epoch 6, batch 0, loss[loss=0.2199, simple_loss=0.2965, pruned_loss=0.07164, over 21735.00 frames. ], tot_loss[loss=0.2199, simple_loss=0.2965, pruned_loss=0.07164, over 21735.00 frames. ], batch size: 124, lr: 5.35e-03, grad_scale: 32.0 2023-06-23 17:29:33,360 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-23 17:29:50,963 INFO [train.py:1028] (2/4) Epoch 6, validation: loss=0.2383, simple_loss=0.345, pruned_loss=0.06586, over 1796401.00 frames. 2023-06-23 17:29:50,964 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 21998MB 2023-06-23 17:29:55,986 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.39 vs. limit=15.0 2023-06-23 17:30:00,829 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=914838.0, ans=0.0 2023-06-23 17:30:28,635 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.741e+02 4.794e+02 6.251e+02 8.348e+02 2.118e+03, threshold=1.250e+03, percent-clipped=42.0 2023-06-23 17:30:35,976 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=914958.0, ans=0.1 2023-06-23 17:31:11,042 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=915018.0, ans=0.0 2023-06-23 17:31:12,958 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=915018.0, ans=0.1 2023-06-23 17:31:25,441 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.19 vs. limit=6.0 2023-06-23 17:31:35,956 INFO [train.py:996] (2/4) Epoch 6, batch 50, loss[loss=0.203, simple_loss=0.2894, pruned_loss=0.05834, over 21364.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.3196, pruned_loss=0.08148, over 964838.42 frames. ], batch size: 194, lr: 5.35e-03, grad_scale: 16.0 2023-06-23 17:31:46,421 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=915138.0, ans=0.0 2023-06-23 17:31:51,262 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=915198.0, ans=0.125 2023-06-23 17:32:15,980 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=915258.0, ans=0.125 2023-06-23 17:33:21,843 INFO [train.py:996] (2/4) Epoch 6, batch 100, loss[loss=0.2501, simple_loss=0.3429, pruned_loss=0.07867, over 21748.00 frames. ], tot_loss[loss=0.2498, simple_loss=0.3321, pruned_loss=0.08377, over 1688946.96 frames. ], batch size: 332, lr: 5.34e-03, grad_scale: 16.0 2023-06-23 17:33:42,709 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=915498.0, ans=15.0 2023-06-23 17:33:43,806 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=915498.0, ans=0.125 2023-06-23 17:34:04,539 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.886e+02 2.333e+02 2.600e+02 2.995e+02 4.991e+02, threshold=5.199e+02, percent-clipped=0.0 2023-06-23 17:34:18,862 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=915558.0, ans=0.125 2023-06-23 17:34:52,761 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=915678.0, ans=0.1 2023-06-23 17:35:09,861 INFO [train.py:996] (2/4) Epoch 6, batch 150, loss[loss=0.2464, simple_loss=0.3445, pruned_loss=0.07408, over 21798.00 frames. ], tot_loss[loss=0.2475, simple_loss=0.3315, pruned_loss=0.0818, over 2256278.63 frames. ], batch size: 332, lr: 5.34e-03, grad_scale: 16.0 2023-06-23 17:35:26,231 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 17:36:43,397 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=915978.0, ans=0.0 2023-06-23 17:36:59,851 INFO [train.py:996] (2/4) Epoch 6, batch 200, loss[loss=0.2854, simple_loss=0.3516, pruned_loss=0.1096, over 21798.00 frames. ], tot_loss[loss=0.2473, simple_loss=0.3296, pruned_loss=0.08251, over 2702145.21 frames. ], batch size: 441, lr: 5.34e-03, grad_scale: 16.0 2023-06-23 17:37:02,966 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.52 vs. limit=12.0 2023-06-23 17:37:11,052 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=916038.0, ans=0.0 2023-06-23 17:37:11,153 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=916038.0, ans=0.125 2023-06-23 17:37:25,373 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=916098.0, ans=0.125 2023-06-23 17:37:40,267 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.885e+02 2.585e+02 2.985e+02 3.639e+02 6.609e+02, threshold=5.970e+02, percent-clipped=4.0 2023-06-23 17:37:46,970 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.58 vs. limit=6.0 2023-06-23 17:38:34,534 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=916278.0, ans=0.125 2023-06-23 17:38:47,087 INFO [train.py:996] (2/4) Epoch 6, batch 250, loss[loss=0.2372, simple_loss=0.3152, pruned_loss=0.07962, over 21515.00 frames. ], tot_loss[loss=0.2421, simple_loss=0.3242, pruned_loss=0.07996, over 3046810.57 frames. ], batch size: 131, lr: 5.34e-03, grad_scale: 16.0 2023-06-23 17:39:22,659 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=916398.0, ans=0.125 2023-06-23 17:40:01,594 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=18.47 vs. limit=22.5 2023-06-23 17:40:07,954 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=916518.0, ans=0.125 2023-06-23 17:40:28,679 INFO [train.py:996] (2/4) Epoch 6, batch 300, loss[loss=0.2323, simple_loss=0.3052, pruned_loss=0.07972, over 21600.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.3184, pruned_loss=0.07908, over 3315569.69 frames. ], batch size: 263, lr: 5.34e-03, grad_scale: 16.0 2023-06-23 17:40:40,216 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=916638.0, ans=0.015 2023-06-23 17:40:57,673 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.81 vs. limit=15.0 2023-06-23 17:41:08,948 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.075e+02 2.631e+02 3.060e+02 3.627e+02 5.054e+02, threshold=6.120e+02, percent-clipped=0.0 2023-06-23 17:41:18,287 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=916758.0, ans=0.0 2023-06-23 17:42:21,708 INFO [train.py:996] (2/4) Epoch 6, batch 350, loss[loss=0.1991, simple_loss=0.2661, pruned_loss=0.06602, over 21833.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3124, pruned_loss=0.0791, over 3524855.47 frames. ], batch size: 352, lr: 5.34e-03, grad_scale: 16.0 2023-06-23 17:42:30,903 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=916938.0, ans=0.2 2023-06-23 17:43:02,481 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=917058.0, ans=0.125 2023-06-23 17:43:26,085 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=917058.0, ans=0.125 2023-06-23 17:43:35,091 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=917118.0, ans=0.125 2023-06-23 17:43:42,767 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.43 vs. limit=15.0 2023-06-23 17:43:49,566 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.15 vs. limit=15.0 2023-06-23 17:43:54,042 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=917178.0, ans=0.015 2023-06-23 17:44:07,646 INFO [train.py:996] (2/4) Epoch 6, batch 400, loss[loss=0.2005, simple_loss=0.2695, pruned_loss=0.06571, over 21635.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.306, pruned_loss=0.07774, over 3687412.62 frames. ], batch size: 298, lr: 5.34e-03, grad_scale: 32.0 2023-06-23 17:44:37,175 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=917298.0, ans=0.125 2023-06-23 17:44:47,061 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.014e+02 2.687e+02 2.996e+02 3.462e+02 5.169e+02, threshold=5.992e+02, percent-clipped=0.0 2023-06-23 17:44:52,181 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.83 vs. limit=10.0 2023-06-23 17:45:55,354 INFO [train.py:996] (2/4) Epoch 6, batch 450, loss[loss=0.2247, simple_loss=0.2806, pruned_loss=0.08443, over 21673.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.3028, pruned_loss=0.07681, over 3823011.46 frames. ], batch size: 417, lr: 5.34e-03, grad_scale: 32.0 2023-06-23 17:46:01,635 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=917538.0, ans=0.0 2023-06-23 17:46:08,734 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=917538.0, ans=0.125 2023-06-23 17:46:31,740 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=917598.0, ans=0.0 2023-06-23 17:47:23,038 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=917718.0, ans=0.125 2023-06-23 17:47:38,492 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=917778.0, ans=0.0 2023-06-23 17:47:46,332 INFO [train.py:996] (2/4) Epoch 6, batch 500, loss[loss=0.2383, simple_loss=0.353, pruned_loss=0.0618, over 19816.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.3048, pruned_loss=0.0757, over 3929782.34 frames. ], batch size: 703, lr: 5.34e-03, grad_scale: 16.0 2023-06-23 17:47:49,377 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.90 vs. limit=22.5 2023-06-23 17:48:32,683 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.054e+02 2.519e+02 2.896e+02 3.744e+02 5.708e+02, threshold=5.793e+02, percent-clipped=0.0 2023-06-23 17:48:47,951 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=917958.0, ans=0.125 2023-06-23 17:49:13,990 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=918078.0, ans=0.0 2023-06-23 17:49:30,898 INFO [train.py:996] (2/4) Epoch 6, batch 550, loss[loss=0.2992, simple_loss=0.3972, pruned_loss=0.1006, over 21542.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.3104, pruned_loss=0.07653, over 4005236.73 frames. ], batch size: 471, lr: 5.34e-03, grad_scale: 16.0 2023-06-23 17:51:15,199 INFO [train.py:996] (2/4) Epoch 6, batch 600, loss[loss=0.2158, simple_loss=0.2776, pruned_loss=0.07701, over 21997.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.3119, pruned_loss=0.07654, over 4070114.11 frames. ], batch size: 103, lr: 5.34e-03, grad_scale: 16.0 2023-06-23 17:52:12,359 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.806e+02 2.707e+02 3.073e+02 3.854e+02 5.945e+02, threshold=6.147e+02, percent-clipped=1.0 2023-06-23 17:52:38,038 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.88 vs. limit=15.0 2023-06-23 17:53:04,169 INFO [train.py:996] (2/4) Epoch 6, batch 650, loss[loss=0.2116, simple_loss=0.2759, pruned_loss=0.07369, over 21854.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.313, pruned_loss=0.07736, over 4124592.28 frames. ], batch size: 107, lr: 5.34e-03, grad_scale: 16.0 2023-06-23 17:53:33,517 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.92 vs. limit=15.0 2023-06-23 17:54:20,058 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.00 vs. limit=6.0 2023-06-23 17:54:46,846 INFO [train.py:996] (2/4) Epoch 6, batch 700, loss[loss=0.2428, simple_loss=0.3229, pruned_loss=0.08134, over 21797.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.312, pruned_loss=0.07744, over 4162035.06 frames. ], batch size: 107, lr: 5.33e-03, grad_scale: 16.0 2023-06-23 17:55:38,549 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.091e+02 2.507e+02 2.938e+02 3.548e+02 4.696e+02, threshold=5.875e+02, percent-clipped=0.0 2023-06-23 17:56:07,112 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=919218.0, ans=0.07 2023-06-23 17:56:35,822 INFO [train.py:996] (2/4) Epoch 6, batch 750, loss[loss=0.2377, simple_loss=0.2933, pruned_loss=0.09103, over 21985.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3151, pruned_loss=0.07905, over 4191219.85 frames. ], batch size: 103, lr: 5.33e-03, grad_scale: 16.0 2023-06-23 17:56:43,614 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.48 vs. limit=12.0 2023-06-23 17:56:55,778 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=919398.0, ans=0.1 2023-06-23 17:57:31,116 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.21 vs. limit=15.0 2023-06-23 17:57:50,016 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.86 vs. limit=15.0 2023-06-23 17:57:52,981 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=919518.0, ans=0.125 2023-06-23 17:58:06,714 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=919578.0, ans=0.1 2023-06-23 17:58:24,801 INFO [train.py:996] (2/4) Epoch 6, batch 800, loss[loss=0.2013, simple_loss=0.2781, pruned_loss=0.06228, over 21682.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.3102, pruned_loss=0.079, over 4210684.00 frames. ], batch size: 263, lr: 5.33e-03, grad_scale: 32.0 2023-06-23 17:58:36,028 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=919638.0, ans=0.2 2023-06-23 17:58:40,937 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=919698.0, ans=0.025 2023-06-23 17:58:51,009 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=919698.0, ans=0.2 2023-06-23 17:59:04,534 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.034e+02 2.561e+02 2.955e+02 3.550e+02 6.098e+02, threshold=5.911e+02, percent-clipped=2.0 2023-06-23 17:59:38,816 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=919818.0, ans=0.125 2023-06-23 18:00:08,606 INFO [train.py:996] (2/4) Epoch 6, batch 850, loss[loss=0.2454, simple_loss=0.3691, pruned_loss=0.06082, over 19723.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3077, pruned_loss=0.0789, over 4230507.18 frames. ], batch size: 703, lr: 5.33e-03, grad_scale: 32.0 2023-06-23 18:00:11,039 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=919938.0, ans=0.125 2023-06-23 18:00:16,732 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=919938.0, ans=0.1 2023-06-23 18:00:37,032 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=919998.0, ans=0.125 2023-06-23 18:01:14,189 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=920058.0, ans=0.0 2023-06-23 18:01:19,722 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=920058.0, ans=0.025 2023-06-23 18:01:59,417 INFO [train.py:996] (2/4) Epoch 6, batch 900, loss[loss=0.2183, simple_loss=0.2842, pruned_loss=0.07621, over 21471.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.3046, pruned_loss=0.07798, over 4248176.28 frames. ], batch size: 194, lr: 5.33e-03, grad_scale: 16.0 2023-06-23 18:02:53,507 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.973e+02 2.549e+02 3.030e+02 3.332e+02 5.799e+02, threshold=6.061e+02, percent-clipped=0.0 2023-06-23 18:03:24,464 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=920418.0, ans=0.125 2023-06-23 18:03:42,650 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.02 vs. limit=15.0 2023-06-23 18:03:50,047 INFO [train.py:996] (2/4) Epoch 6, batch 950, loss[loss=0.2358, simple_loss=0.3112, pruned_loss=0.08023, over 21877.00 frames. ], tot_loss[loss=0.2276, simple_loss=0.3011, pruned_loss=0.07709, over 4260853.64 frames. ], batch size: 107, lr: 5.33e-03, grad_scale: 16.0 2023-06-23 18:04:22,931 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.53 vs. limit=10.0 2023-06-23 18:04:32,662 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=920598.0, ans=0.1 2023-06-23 18:04:34,821 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=920598.0, ans=0.025 2023-06-23 18:05:11,374 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=920718.0, ans=0.0 2023-06-23 18:05:41,248 INFO [train.py:996] (2/4) Epoch 6, batch 1000, loss[loss=0.2398, simple_loss=0.3317, pruned_loss=0.07391, over 21628.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.2997, pruned_loss=0.07695, over 4270078.92 frames. ], batch size: 414, lr: 5.33e-03, grad_scale: 16.0 2023-06-23 18:06:08,811 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.34 vs. limit=15.0 2023-06-23 18:06:25,156 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=920898.0, ans=0.04949747468305833 2023-06-23 18:06:42,107 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.855e+02 2.583e+02 2.913e+02 3.407e+02 5.854e+02, threshold=5.827e+02, percent-clipped=0.0 2023-06-23 18:07:12,662 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=921018.0, ans=0.0 2023-06-23 18:07:30,516 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.92 vs. limit=15.0 2023-06-23 18:07:32,574 INFO [train.py:996] (2/4) Epoch 6, batch 1050, loss[loss=0.2033, simple_loss=0.2905, pruned_loss=0.05805, over 21742.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.3028, pruned_loss=0.07812, over 4279929.74 frames. ], batch size: 247, lr: 5.33e-03, grad_scale: 16.0 2023-06-23 18:08:42,432 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=921258.0, ans=0.0 2023-06-23 18:08:51,745 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=921318.0, ans=0.125 2023-06-23 18:09:28,476 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=921378.0, ans=0.125 2023-06-23 18:09:30,062 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 18:09:31,287 INFO [train.py:996] (2/4) Epoch 6, batch 1100, loss[loss=0.2287, simple_loss=0.3104, pruned_loss=0.07348, over 21531.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.3043, pruned_loss=0.07819, over 4280617.68 frames. ], batch size: 471, lr: 5.33e-03, grad_scale: 16.0 2023-06-23 18:09:36,391 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.27 vs. limit=12.0 2023-06-23 18:09:57,982 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=921438.0, ans=0.125 2023-06-23 18:10:25,289 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.974e+02 2.670e+02 3.079e+02 4.028e+02 7.418e+02, threshold=6.158e+02, percent-clipped=6.0 2023-06-23 18:11:29,977 INFO [train.py:996] (2/4) Epoch 6, batch 1150, loss[loss=0.226, simple_loss=0.2996, pruned_loss=0.07619, over 21301.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3027, pruned_loss=0.07645, over 4285331.17 frames. ], batch size: 143, lr: 5.33e-03, grad_scale: 16.0 2023-06-23 18:12:57,861 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=921978.0, ans=0.05 2023-06-23 18:13:10,781 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=921978.0, ans=0.1 2023-06-23 18:13:16,154 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=922038.0, ans=0.125 2023-06-23 18:13:17,166 INFO [train.py:996] (2/4) Epoch 6, batch 1200, loss[loss=0.2481, simple_loss=0.347, pruned_loss=0.07459, over 21649.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.3033, pruned_loss=0.07666, over 4280157.64 frames. ], batch size: 414, lr: 5.33e-03, grad_scale: 32.0 2023-06-23 18:13:20,460 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.30 vs. limit=15.0 2023-06-23 18:13:31,620 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=922038.0, ans=0.125 2023-06-23 18:13:33,301 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=922038.0, ans=0.125 2023-06-23 18:13:45,890 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 18:13:59,212 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.082e+02 2.616e+02 3.018e+02 3.638e+02 5.698e+02, threshold=6.035e+02, percent-clipped=0.0 2023-06-23 18:14:42,568 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=922278.0, ans=0.125 2023-06-23 18:15:07,513 INFO [train.py:996] (2/4) Epoch 6, batch 1250, loss[loss=0.2276, simple_loss=0.3111, pruned_loss=0.072, over 21677.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3055, pruned_loss=0.07834, over 4287135.53 frames. ], batch size: 389, lr: 5.32e-03, grad_scale: 32.0 2023-06-23 18:15:42,647 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=922398.0, ans=0.0 2023-06-23 18:15:51,891 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=922458.0, ans=0.125 2023-06-23 18:15:55,621 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=922458.0, ans=0.125 2023-06-23 18:16:59,820 INFO [train.py:996] (2/4) Epoch 6, batch 1300, loss[loss=0.2339, simple_loss=0.3015, pruned_loss=0.08314, over 21359.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3063, pruned_loss=0.07864, over 4293375.22 frames. ], batch size: 159, lr: 5.32e-03, grad_scale: 32.0 2023-06-23 18:17:11,058 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=922638.0, ans=0.125 2023-06-23 18:17:42,856 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.046e+02 2.762e+02 3.245e+02 4.001e+02 7.520e+02, threshold=6.490e+02, percent-clipped=2.0 2023-06-23 18:17:48,958 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=922758.0, ans=0.0 2023-06-23 18:17:49,085 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=922758.0, ans=0.04949747468305833 2023-06-23 18:17:54,631 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=922758.0, ans=0.125 2023-06-23 18:17:56,353 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff2.min_abs, batch_count=922758.0, ans=0.1 2023-06-23 18:18:25,627 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=922818.0, ans=0.5 2023-06-23 18:18:29,181 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=922878.0, ans=0.125 2023-06-23 18:18:30,996 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=922878.0, ans=0.1 2023-06-23 18:18:42,272 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.92 vs. limit=10.0 2023-06-23 18:18:46,523 INFO [train.py:996] (2/4) Epoch 6, batch 1350, loss[loss=0.2287, simple_loss=0.2883, pruned_loss=0.08456, over 21327.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3068, pruned_loss=0.07941, over 4291585.93 frames. ], batch size: 159, lr: 5.32e-03, grad_scale: 32.0 2023-06-23 18:18:59,849 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=922938.0, ans=0.2 2023-06-23 18:19:12,183 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=922998.0, ans=0.0 2023-06-23 18:19:46,502 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=923058.0, ans=0.125 2023-06-23 18:20:06,153 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=923118.0, ans=0.05 2023-06-23 18:20:26,801 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=923178.0, ans=0.015 2023-06-23 18:20:36,687 INFO [train.py:996] (2/4) Epoch 6, batch 1400, loss[loss=0.2411, simple_loss=0.3113, pruned_loss=0.08548, over 21315.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.3052, pruned_loss=0.07891, over 4293678.24 frames. ], batch size: 159, lr: 5.32e-03, grad_scale: 32.0 2023-06-23 18:20:49,403 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=923238.0, ans=0.1 2023-06-23 18:20:53,277 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=923238.0, ans=0.1 2023-06-23 18:20:58,740 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 18:21:16,297 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=923358.0, ans=0.125 2023-06-23 18:21:20,118 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=923358.0, ans=0.125 2023-06-23 18:21:21,173 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.008e+02 2.458e+02 2.680e+02 3.185e+02 5.161e+02, threshold=5.361e+02, percent-clipped=0.0 2023-06-23 18:22:28,241 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.85 vs. limit=15.0 2023-06-23 18:22:35,797 INFO [train.py:996] (2/4) Epoch 6, batch 1450, loss[loss=0.2327, simple_loss=0.3062, pruned_loss=0.0796, over 21384.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3063, pruned_loss=0.0796, over 4290911.07 frames. ], batch size: 549, lr: 5.32e-03, grad_scale: 16.0 2023-06-23 18:22:37,999 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=923538.0, ans=0.125 2023-06-23 18:23:10,057 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=923658.0, ans=0.0 2023-06-23 18:23:15,341 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 18:23:28,672 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=923658.0, ans=0.125 2023-06-23 18:23:29,206 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.05 vs. limit=6.0 2023-06-23 18:24:05,882 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=923778.0, ans=0.1 2023-06-23 18:24:20,098 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=923838.0, ans=0.1 2023-06-23 18:24:26,510 INFO [train.py:996] (2/4) Epoch 6, batch 1500, loss[loss=0.2354, simple_loss=0.2985, pruned_loss=0.08617, over 21615.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3091, pruned_loss=0.08134, over 4295979.45 frames. ], batch size: 548, lr: 5.32e-03, grad_scale: 16.0 2023-06-23 18:24:38,039 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=923838.0, ans=0.125 2023-06-23 18:25:05,056 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=923958.0, ans=0.2 2023-06-23 18:25:06,058 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.940e+02 2.614e+02 2.900e+02 3.425e+02 5.180e+02, threshold=5.801e+02, percent-clipped=0.0 2023-06-23 18:25:11,154 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.32 vs. limit=15.0 2023-06-23 18:25:42,121 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=924018.0, ans=0.5 2023-06-23 18:26:20,379 INFO [train.py:996] (2/4) Epoch 6, batch 1550, loss[loss=0.2212, simple_loss=0.2943, pruned_loss=0.07405, over 21866.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3051, pruned_loss=0.07858, over 4301294.13 frames. ], batch size: 351, lr: 5.32e-03, grad_scale: 16.0 2023-06-23 18:26:24,979 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=924138.0, ans=0.125 2023-06-23 18:27:22,664 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=924258.0, ans=0.125 2023-06-23 18:28:14,018 INFO [train.py:996] (2/4) Epoch 6, batch 1600, loss[loss=0.2206, simple_loss=0.2962, pruned_loss=0.07252, over 20039.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.3029, pruned_loss=0.07818, over 4299322.92 frames. ], batch size: 702, lr: 5.32e-03, grad_scale: 32.0 2023-06-23 18:28:33,512 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=924498.0, ans=0.0 2023-06-23 18:29:08,555 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.100e+02 2.611e+02 2.907e+02 3.387e+02 5.572e+02, threshold=5.813e+02, percent-clipped=0.0 2023-06-23 18:29:52,998 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=924678.0, ans=0.0 2023-06-23 18:30:08,420 INFO [train.py:996] (2/4) Epoch 6, batch 1650, loss[loss=0.2806, simple_loss=0.3324, pruned_loss=0.1144, over 21607.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.3025, pruned_loss=0.07838, over 4287659.94 frames. ], batch size: 471, lr: 5.32e-03, grad_scale: 16.0 2023-06-23 18:30:37,722 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=924798.0, ans=0.125 2023-06-23 18:30:44,872 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=924798.0, ans=0.1 2023-06-23 18:32:02,900 INFO [train.py:996] (2/4) Epoch 6, batch 1700, loss[loss=0.1878, simple_loss=0.2828, pruned_loss=0.04643, over 21647.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3043, pruned_loss=0.07897, over 4286813.99 frames. ], batch size: 263, lr: 5.32e-03, grad_scale: 16.0 2023-06-23 18:33:01,017 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.000e+02 2.590e+02 2.907e+02 3.447e+02 5.734e+02, threshold=5.814e+02, percent-clipped=0.0 2023-06-23 18:33:29,090 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=925218.0, ans=0.1 2023-06-23 18:34:02,126 INFO [train.py:996] (2/4) Epoch 6, batch 1750, loss[loss=0.1509, simple_loss=0.2198, pruned_loss=0.04098, over 21377.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.3044, pruned_loss=0.07673, over 4288823.31 frames. ], batch size: 131, lr: 5.32e-03, grad_scale: 16.0 2023-06-23 18:35:11,012 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=925458.0, ans=0.1 2023-06-23 18:35:20,267 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=925518.0, ans=0.09899494936611666 2023-06-23 18:35:50,501 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=925578.0, ans=0.0 2023-06-23 18:36:02,464 INFO [train.py:996] (2/4) Epoch 6, batch 1800, loss[loss=0.2172, simple_loss=0.2956, pruned_loss=0.06945, over 21737.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.3025, pruned_loss=0.07429, over 4292322.62 frames. ], batch size: 298, lr: 5.32e-03, grad_scale: 16.0 2023-06-23 18:36:28,729 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 18:36:54,962 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=925758.0, ans=0.125 2023-06-23 18:36:56,012 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.930e+02 2.395e+02 2.914e+02 3.634e+02 6.423e+02, threshold=5.828e+02, percent-clipped=1.0 2023-06-23 18:37:01,858 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=925758.0, ans=0.125 2023-06-23 18:37:12,780 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=925818.0, ans=0.0 2023-06-23 18:37:34,957 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.28 vs. limit=22.5 2023-06-23 18:37:40,052 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.04 vs. limit=12.0 2023-06-23 18:37:52,389 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=925938.0, ans=0.125 2023-06-23 18:37:53,480 INFO [train.py:996] (2/4) Epoch 6, batch 1850, loss[loss=0.2177, simple_loss=0.3132, pruned_loss=0.06105, over 21753.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.3053, pruned_loss=0.07365, over 4293696.66 frames. ], batch size: 351, lr: 5.31e-03, grad_scale: 8.0 2023-06-23 18:38:01,252 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=925938.0, ans=0.0 2023-06-23 18:39:46,169 INFO [train.py:996] (2/4) Epoch 6, batch 1900, loss[loss=0.2417, simple_loss=0.3153, pruned_loss=0.08406, over 21193.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.3037, pruned_loss=0.07345, over 4293664.90 frames. ], batch size: 548, lr: 5.31e-03, grad_scale: 8.0 2023-06-23 18:40:10,722 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=926298.0, ans=0.125 2023-06-23 18:40:39,858 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.885e+02 2.383e+02 2.644e+02 3.253e+02 4.924e+02, threshold=5.288e+02, percent-clipped=0.0 2023-06-23 18:41:07,941 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=926418.0, ans=0.0 2023-06-23 18:41:37,828 INFO [train.py:996] (2/4) Epoch 6, batch 1950, loss[loss=0.2137, simple_loss=0.3134, pruned_loss=0.05698, over 21698.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.3009, pruned_loss=0.07239, over 4291527.35 frames. ], batch size: 298, lr: 5.31e-03, grad_scale: 8.0 2023-06-23 18:41:47,416 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=926538.0, ans=0.0 2023-06-23 18:42:20,695 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=926598.0, ans=0.1 2023-06-23 18:42:40,236 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.66 vs. limit=10.0 2023-06-23 18:43:25,157 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=926778.0, ans=0.0 2023-06-23 18:43:37,088 INFO [train.py:996] (2/4) Epoch 6, batch 2000, loss[loss=0.1845, simple_loss=0.2602, pruned_loss=0.0544, over 21294.00 frames. ], tot_loss[loss=0.219, simple_loss=0.2972, pruned_loss=0.07046, over 4293619.36 frames. ], batch size: 159, lr: 5.31e-03, grad_scale: 16.0 2023-06-23 18:43:39,683 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 18:43:49,950 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=926838.0, ans=0.1 2023-06-23 18:44:14,942 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=926898.0, ans=0.1 2023-06-23 18:44:23,620 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=926958.0, ans=0.2 2023-06-23 18:44:24,631 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.849e+02 2.599e+02 2.979e+02 3.641e+02 7.240e+02, threshold=5.958e+02, percent-clipped=3.0 2023-06-23 18:44:28,815 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=926958.0, ans=0.125 2023-06-23 18:44:46,328 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.60 vs. limit=15.0 2023-06-23 18:45:02,009 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=927078.0, ans=0.125 2023-06-23 18:45:28,375 INFO [train.py:996] (2/4) Epoch 6, batch 2050, loss[loss=0.1788, simple_loss=0.2499, pruned_loss=0.05386, over 21530.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.2972, pruned_loss=0.07077, over 4292253.80 frames. ], batch size: 195, lr: 5.31e-03, grad_scale: 16.0 2023-06-23 18:45:39,576 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=927138.0, ans=0.125 2023-06-23 18:45:57,958 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=927198.0, ans=0.0 2023-06-23 18:45:58,028 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=927198.0, ans=0.0 2023-06-23 18:46:08,990 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=927258.0, ans=0.04949747468305833 2023-06-23 18:46:17,665 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=927258.0, ans=0.95 2023-06-23 18:47:18,338 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.61 vs. limit=6.0 2023-06-23 18:47:20,411 INFO [train.py:996] (2/4) Epoch 6, batch 2100, loss[loss=0.2163, simple_loss=0.2971, pruned_loss=0.06774, over 21657.00 frames. ], tot_loss[loss=0.223, simple_loss=0.3, pruned_loss=0.07298, over 4289683.09 frames. ], batch size: 263, lr: 5.31e-03, grad_scale: 16.0 2023-06-23 18:48:05,681 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=927558.0, ans=0.1 2023-06-23 18:48:07,860 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.90 vs. limit=12.0 2023-06-23 18:48:08,467 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.892e+02 2.503e+02 2.741e+02 3.125e+02 4.918e+02, threshold=5.483e+02, percent-clipped=0.0 2023-06-23 18:48:09,784 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.97 vs. limit=12.0 2023-06-23 18:48:35,571 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.27 vs. limit=10.0 2023-06-23 18:49:12,102 INFO [train.py:996] (2/4) Epoch 6, batch 2150, loss[loss=0.2377, simple_loss=0.3194, pruned_loss=0.07803, over 21374.00 frames. ], tot_loss[loss=0.226, simple_loss=0.3016, pruned_loss=0.0752, over 4287645.47 frames. ], batch size: 211, lr: 5.31e-03, grad_scale: 16.0 2023-06-23 18:49:16,414 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=927738.0, ans=0.125 2023-06-23 18:49:33,340 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=927738.0, ans=0.2 2023-06-23 18:49:34,904 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=927798.0, ans=0.1 2023-06-23 18:50:40,802 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=927978.0, ans=0.125 2023-06-23 18:51:00,006 INFO [train.py:996] (2/4) Epoch 6, batch 2200, loss[loss=0.2139, simple_loss=0.2701, pruned_loss=0.07881, over 21291.00 frames. ], tot_loss[loss=0.228, simple_loss=0.3034, pruned_loss=0.07627, over 4281163.79 frames. ], batch size: 608, lr: 5.31e-03, grad_scale: 16.0 2023-06-23 18:51:06,286 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.45 vs. limit=15.0 2023-06-23 18:51:20,234 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=928038.0, ans=0.0 2023-06-23 18:51:47,933 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.154e+02 2.632e+02 2.959e+02 3.421e+02 5.687e+02, threshold=5.917e+02, percent-clipped=1.0 2023-06-23 18:52:49,701 INFO [train.py:996] (2/4) Epoch 6, batch 2250, loss[loss=0.1869, simple_loss=0.2514, pruned_loss=0.06118, over 21764.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.3008, pruned_loss=0.07466, over 4277261.15 frames. ], batch size: 112, lr: 5.31e-03, grad_scale: 16.0 2023-06-23 18:53:17,091 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=928398.0, ans=0.125 2023-06-23 18:53:25,982 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.66 vs. limit=22.5 2023-06-23 18:54:34,496 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=928578.0, ans=0.125 2023-06-23 18:54:40,537 INFO [train.py:996] (2/4) Epoch 6, batch 2300, loss[loss=0.2202, simple_loss=0.2876, pruned_loss=0.0764, over 22019.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.2978, pruned_loss=0.07475, over 4279861.38 frames. ], batch size: 103, lr: 5.31e-03, grad_scale: 16.0 2023-06-23 18:55:22,270 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=928758.0, ans=0.125 2023-06-23 18:55:28,458 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.991e+02 2.420e+02 2.816e+02 3.301e+02 5.962e+02, threshold=5.633e+02, percent-clipped=1.0 2023-06-23 18:55:59,820 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.73 vs. limit=15.0 2023-06-23 18:56:38,344 INFO [train.py:996] (2/4) Epoch 6, batch 2350, loss[loss=0.2217, simple_loss=0.3034, pruned_loss=0.06999, over 20692.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.2955, pruned_loss=0.07466, over 4264802.64 frames. ], batch size: 607, lr: 5.31e-03, grad_scale: 16.0 2023-06-23 18:56:45,973 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=928938.0, ans=0.2 2023-06-23 18:56:58,660 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=928998.0, ans=0.1 2023-06-23 18:57:14,320 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=929058.0, ans=0.05 2023-06-23 18:58:12,076 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.34 vs. limit=22.5 2023-06-23 18:58:13,030 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=929178.0, ans=0.1 2023-06-23 18:58:18,410 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=929178.0, ans=0.125 2023-06-23 18:58:30,904 INFO [train.py:996] (2/4) Epoch 6, batch 2400, loss[loss=0.1946, simple_loss=0.2577, pruned_loss=0.06578, over 21559.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.2993, pruned_loss=0.07645, over 4260615.02 frames. ], batch size: 263, lr: 5.31e-03, grad_scale: 32.0 2023-06-23 18:59:20,301 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 18:59:21,202 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.092e+02 2.599e+02 2.851e+02 3.513e+02 5.978e+02, threshold=5.701e+02, percent-clipped=2.0 2023-06-23 19:00:05,618 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.73 vs. limit=15.0 2023-06-23 19:00:22,719 INFO [train.py:996] (2/4) Epoch 6, batch 2450, loss[loss=0.2497, simple_loss=0.3277, pruned_loss=0.08588, over 21765.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.3037, pruned_loss=0.07862, over 4260509.67 frames. ], batch size: 113, lr: 5.30e-03, grad_scale: 16.0 2023-06-23 19:00:39,309 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.32 vs. limit=15.0 2023-06-23 19:00:54,370 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=929598.0, ans=0.0 2023-06-23 19:01:28,544 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=929718.0, ans=0.125 2023-06-23 19:02:01,200 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=929778.0, ans=0.2 2023-06-23 19:02:13,101 INFO [train.py:996] (2/4) Epoch 6, batch 2500, loss[loss=0.2226, simple_loss=0.3018, pruned_loss=0.07169, over 21550.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.304, pruned_loss=0.07936, over 4266757.97 frames. ], batch size: 414, lr: 5.30e-03, grad_scale: 16.0 2023-06-23 19:02:13,698 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=929838.0, ans=0.125 2023-06-23 19:02:45,659 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=929898.0, ans=0.125 2023-06-23 19:03:03,173 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.913e+02 2.544e+02 2.837e+02 3.478e+02 5.146e+02, threshold=5.674e+02, percent-clipped=0.0 2023-06-23 19:03:42,369 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=930078.0, ans=0.125 2023-06-23 19:04:04,689 INFO [train.py:996] (2/4) Epoch 6, batch 2550, loss[loss=0.2033, simple_loss=0.2836, pruned_loss=0.06148, over 21393.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.3012, pruned_loss=0.07774, over 4271642.92 frames. ], batch size: 131, lr: 5.30e-03, grad_scale: 16.0 2023-06-23 19:04:11,183 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=930138.0, ans=0.0 2023-06-23 19:04:14,247 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=930138.0, ans=0.125 2023-06-23 19:04:38,095 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=930198.0, ans=0.2 2023-06-23 19:04:51,117 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=930258.0, ans=0.0 2023-06-23 19:05:57,812 INFO [train.py:996] (2/4) Epoch 6, batch 2600, loss[loss=0.1904, simple_loss=0.2636, pruned_loss=0.05857, over 21590.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3007, pruned_loss=0.07745, over 4263372.78 frames. ], batch size: 247, lr: 5.30e-03, grad_scale: 16.0 2023-06-23 19:06:08,036 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.21 vs. limit=15.0 2023-06-23 19:06:45,800 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.66 vs. limit=15.0 2023-06-23 19:06:47,831 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.153e+02 2.627e+02 2.988e+02 3.634e+02 5.525e+02, threshold=5.976e+02, percent-clipped=0.0 2023-06-23 19:06:54,022 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.32 vs. limit=15.0 2023-06-23 19:06:59,315 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.41 vs. limit=15.0 2023-06-23 19:07:03,839 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=930618.0, ans=0.125 2023-06-23 19:07:31,437 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=930678.0, ans=0.0 2023-06-23 19:07:40,762 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=930678.0, ans=0.2 2023-06-23 19:07:40,904 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=930678.0, ans=0.0 2023-06-23 19:07:49,114 INFO [train.py:996] (2/4) Epoch 6, batch 2650, loss[loss=0.2368, simple_loss=0.3174, pruned_loss=0.07812, over 21473.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.3019, pruned_loss=0.07731, over 4264170.87 frames. ], batch size: 131, lr: 5.30e-03, grad_scale: 16.0 2023-06-23 19:07:51,271 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=930738.0, ans=0.0 2023-06-23 19:08:28,791 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=930798.0, ans=0.125 2023-06-23 19:08:49,885 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=930918.0, ans=0.2 2023-06-23 19:09:42,226 INFO [train.py:996] (2/4) Epoch 6, batch 2700, loss[loss=0.276, simple_loss=0.3451, pruned_loss=0.1035, over 21555.00 frames. ], tot_loss[loss=0.228, simple_loss=0.3011, pruned_loss=0.07747, over 4255886.27 frames. ], batch size: 471, lr: 5.30e-03, grad_scale: 16.0 2023-06-23 19:10:07,715 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=931098.0, ans=0.125 2023-06-23 19:10:12,875 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=931098.0, ans=0.125 2023-06-23 19:10:22,677 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=931158.0, ans=0.125 2023-06-23 19:10:28,190 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 19:10:32,950 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.187e+02 2.709e+02 3.074e+02 3.590e+02 5.374e+02, threshold=6.148e+02, percent-clipped=0.0 2023-06-23 19:10:48,271 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=931218.0, ans=10.0 2023-06-23 19:11:34,511 INFO [train.py:996] (2/4) Epoch 6, batch 2750, loss[loss=0.2523, simple_loss=0.3266, pruned_loss=0.08906, over 21693.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.2998, pruned_loss=0.07732, over 4253674.59 frames. ], batch size: 441, lr: 5.30e-03, grad_scale: 16.0 2023-06-23 19:11:53,635 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=931398.0, ans=0.1 2023-06-23 19:12:53,737 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=931518.0, ans=0.1 2023-06-23 19:13:24,221 INFO [train.py:996] (2/4) Epoch 6, batch 2800, loss[loss=0.2585, simple_loss=0.3398, pruned_loss=0.08864, over 21768.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.304, pruned_loss=0.07805, over 4258570.18 frames. ], batch size: 298, lr: 5.30e-03, grad_scale: 32.0 2023-06-23 19:13:43,883 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=931638.0, ans=0.125 2023-06-23 19:14:10,312 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=931758.0, ans=0.0 2023-06-23 19:14:22,177 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.082e+02 2.716e+02 3.036e+02 3.413e+02 5.034e+02, threshold=6.071e+02, percent-clipped=0.0 2023-06-23 19:15:13,161 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=931878.0, ans=0.0 2023-06-23 19:15:18,179 INFO [train.py:996] (2/4) Epoch 6, batch 2850, loss[loss=0.201, simple_loss=0.2595, pruned_loss=0.07122, over 21273.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3077, pruned_loss=0.08006, over 4257014.24 frames. ], batch size: 549, lr: 5.30e-03, grad_scale: 32.0 2023-06-23 19:15:40,268 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=931998.0, ans=0.07 2023-06-23 19:16:08,789 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=932058.0, ans=0.1 2023-06-23 19:16:08,866 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=932058.0, ans=0.125 2023-06-23 19:16:52,274 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=932178.0, ans=0.0 2023-06-23 19:16:59,893 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=932178.0, ans=15.0 2023-06-23 19:17:04,580 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=932178.0, ans=0.125 2023-06-23 19:17:07,581 INFO [train.py:996] (2/4) Epoch 6, batch 2900, loss[loss=0.2572, simple_loss=0.3214, pruned_loss=0.09648, over 22056.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.3044, pruned_loss=0.0794, over 4266738.11 frames. ], batch size: 119, lr: 5.30e-03, grad_scale: 32.0 2023-06-23 19:17:17,597 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.58 vs. limit=12.0 2023-06-23 19:18:03,657 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.041e+02 2.630e+02 3.132e+02 3.824e+02 7.694e+02, threshold=6.265e+02, percent-clipped=2.0 2023-06-23 19:18:21,571 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=932418.0, ans=0.05 2023-06-23 19:18:50,011 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=932478.0, ans=0.125 2023-06-23 19:18:51,858 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=932478.0, ans=0.1 2023-06-23 19:18:58,083 INFO [train.py:996] (2/4) Epoch 6, batch 2950, loss[loss=0.2635, simple_loss=0.3486, pruned_loss=0.0892, over 21656.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.3052, pruned_loss=0.07973, over 4271521.27 frames. ], batch size: 263, lr: 5.30e-03, grad_scale: 32.0 2023-06-23 19:19:18,006 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=932538.0, ans=0.125 2023-06-23 19:19:26,543 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=932598.0, ans=0.125 2023-06-23 19:19:27,231 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.00 vs. limit=15.0 2023-06-23 19:19:39,690 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.52 vs. limit=10.0 2023-06-23 19:20:04,291 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=932658.0, ans=0.0 2023-06-23 19:20:04,323 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=932658.0, ans=0.025 2023-06-23 19:20:50,757 INFO [train.py:996] (2/4) Epoch 6, batch 3000, loss[loss=0.2809, simple_loss=0.3629, pruned_loss=0.0994, over 21468.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3092, pruned_loss=0.07984, over 4275700.05 frames. ], batch size: 131, lr: 5.29e-03, grad_scale: 32.0 2023-06-23 19:20:50,757 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-23 19:21:13,128 INFO [train.py:1028] (2/4) Epoch 6, validation: loss=0.2526, simple_loss=0.3435, pruned_loss=0.08085, over 1796401.00 frames. 2023-06-23 19:21:13,129 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 23731MB 2023-06-23 19:22:00,853 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=932898.0, ans=0.09899494936611666 2023-06-23 19:22:04,233 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=932958.0, ans=0.2 2023-06-23 19:22:14,695 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.002e+02 2.535e+02 2.851e+02 3.436e+02 5.853e+02, threshold=5.702e+02, percent-clipped=0.0 2023-06-23 19:22:15,694 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=932958.0, ans=0.125 2023-06-23 19:23:04,079 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=933138.0, ans=0.0 2023-06-23 19:23:05,139 INFO [train.py:996] (2/4) Epoch 6, batch 3050, loss[loss=0.1802, simple_loss=0.2559, pruned_loss=0.05221, over 21455.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3099, pruned_loss=0.07903, over 4280367.64 frames. ], batch size: 194, lr: 5.29e-03, grad_scale: 32.0 2023-06-23 19:24:53,946 INFO [train.py:996] (2/4) Epoch 6, batch 3100, loss[loss=0.2191, simple_loss=0.3076, pruned_loss=0.06527, over 21604.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3097, pruned_loss=0.07795, over 4284701.29 frames. ], batch size: 230, lr: 5.29e-03, grad_scale: 32.0 2023-06-23 19:25:55,813 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.838e+02 2.716e+02 3.164e+02 3.740e+02 6.470e+02, threshold=6.328e+02, percent-clipped=4.0 2023-06-23 19:26:31,714 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=933678.0, ans=0.125 2023-06-23 19:26:33,989 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.64 vs. limit=10.0 2023-06-23 19:26:37,792 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.71 vs. limit=15.0 2023-06-23 19:26:52,677 INFO [train.py:996] (2/4) Epoch 6, batch 3150, loss[loss=0.2398, simple_loss=0.3116, pruned_loss=0.08403, over 21594.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3104, pruned_loss=0.07787, over 4279294.21 frames. ], batch size: 230, lr: 5.29e-03, grad_scale: 32.0 2023-06-23 19:26:54,020 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.83 vs. limit=6.0 2023-06-23 19:27:13,531 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=933738.0, ans=0.0 2023-06-23 19:27:22,805 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=933798.0, ans=0.0 2023-06-23 19:27:53,773 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=933858.0, ans=0.0 2023-06-23 19:28:56,532 INFO [train.py:996] (2/4) Epoch 6, batch 3200, loss[loss=0.2644, simple_loss=0.3461, pruned_loss=0.09138, over 21756.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3123, pruned_loss=0.0786, over 4274732.39 frames. ], batch size: 441, lr: 5.29e-03, grad_scale: 32.0 2023-06-23 19:29:20,191 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=934098.0, ans=0.1 2023-06-23 19:29:29,025 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=934098.0, ans=0.125 2023-06-23 19:29:46,297 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.879e+02 2.521e+02 2.818e+02 3.375e+02 4.819e+02, threshold=5.636e+02, percent-clipped=0.0 2023-06-23 19:30:24,631 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=934278.0, ans=0.1 2023-06-23 19:30:46,854 INFO [train.py:996] (2/4) Epoch 6, batch 3250, loss[loss=0.2172, simple_loss=0.279, pruned_loss=0.07768, over 21744.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.3149, pruned_loss=0.08096, over 4280239.74 frames. ], batch size: 282, lr: 5.29e-03, grad_scale: 32.0 2023-06-23 19:31:05,560 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=934398.0, ans=0.125 2023-06-23 19:31:09,869 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.94 vs. limit=15.0 2023-06-23 19:31:38,438 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.83 vs. limit=22.5 2023-06-23 19:32:40,238 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=934638.0, ans=0.0 2023-06-23 19:32:41,478 INFO [train.py:996] (2/4) Epoch 6, batch 3300, loss[loss=0.2073, simple_loss=0.2896, pruned_loss=0.06255, over 21388.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.3076, pruned_loss=0.0803, over 4279840.97 frames. ], batch size: 194, lr: 5.29e-03, grad_scale: 32.0 2023-06-23 19:32:48,006 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=934638.0, ans=0.0 2023-06-23 19:33:19,220 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=934698.0, ans=0.125 2023-06-23 19:33:26,073 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=934758.0, ans=0.125 2023-06-23 19:33:26,858 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.98 vs. limit=15.0 2023-06-23 19:33:38,658 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.006e+02 2.620e+02 2.941e+02 3.334e+02 7.153e+02, threshold=5.881e+02, percent-clipped=1.0 2023-06-23 19:33:45,116 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.44 vs. limit=22.5 2023-06-23 19:34:26,794 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=934878.0, ans=0.125 2023-06-23 19:34:27,288 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.10 vs. limit=10.0 2023-06-23 19:34:33,186 INFO [train.py:996] (2/4) Epoch 6, batch 3350, loss[loss=0.2138, simple_loss=0.2874, pruned_loss=0.07007, over 21836.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3101, pruned_loss=0.08025, over 4277330.69 frames. ], batch size: 247, lr: 5.29e-03, grad_scale: 32.0 2023-06-23 19:35:00,076 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.58 vs. limit=6.0 2023-06-23 19:35:09,424 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=934998.0, ans=0.125 2023-06-23 19:35:13,665 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.34 vs. limit=15.0 2023-06-23 19:35:29,230 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=935058.0, ans=0.125 2023-06-23 19:35:29,758 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.78 vs. limit=15.0 2023-06-23 19:35:37,886 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=935118.0, ans=0.2 2023-06-23 19:36:03,190 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=935118.0, ans=0.125 2023-06-23 19:36:25,752 INFO [train.py:996] (2/4) Epoch 6, batch 3400, loss[loss=0.2096, simple_loss=0.2761, pruned_loss=0.07152, over 21244.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3093, pruned_loss=0.08117, over 4275477.99 frames. ], batch size: 176, lr: 5.29e-03, grad_scale: 16.0 2023-06-23 19:37:08,322 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=935298.0, ans=0.0 2023-06-23 19:37:15,615 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=935358.0, ans=0.5 2023-06-23 19:37:16,287 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.67 vs. limit=15.0 2023-06-23 19:37:19,291 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=935358.0, ans=0.2 2023-06-23 19:37:30,209 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.050e+02 2.631e+02 2.892e+02 3.496e+02 6.427e+02, threshold=5.784e+02, percent-clipped=1.0 2023-06-23 19:37:47,877 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=10.55 vs. limit=15.0 2023-06-23 19:37:52,223 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=935418.0, ans=0.2 2023-06-23 19:38:04,992 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=935478.0, ans=0.0 2023-06-23 19:38:08,473 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=935478.0, ans=0.125 2023-06-23 19:38:16,300 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.18 vs. limit=12.0 2023-06-23 19:38:18,970 INFO [train.py:996] (2/4) Epoch 6, batch 3450, loss[loss=0.2128, simple_loss=0.2814, pruned_loss=0.0721, over 21941.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3056, pruned_loss=0.08003, over 4275076.45 frames. ], batch size: 113, lr: 5.29e-03, grad_scale: 16.0 2023-06-23 19:39:32,140 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=935658.0, ans=0.125 2023-06-23 19:40:06,203 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.28 vs. limit=15.0 2023-06-23 19:40:16,024 INFO [train.py:996] (2/4) Epoch 6, batch 3500, loss[loss=0.2482, simple_loss=0.3242, pruned_loss=0.08611, over 21277.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3113, pruned_loss=0.08288, over 4257123.75 frames. ], batch size: 143, lr: 5.29e-03, grad_scale: 16.0 2023-06-23 19:40:57,788 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.whiten.whitening_limit, batch_count=935898.0, ans=12.0 2023-06-23 19:41:07,369 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=935958.0, ans=0.0 2023-06-23 19:41:16,946 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.217e+02 2.777e+02 3.098e+02 3.671e+02 6.397e+02, threshold=6.196e+02, percent-clipped=1.0 2023-06-23 19:41:35,524 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=936018.0, ans=0.2 2023-06-23 19:41:44,974 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.31 vs. limit=15.0 2023-06-23 19:42:08,077 INFO [train.py:996] (2/4) Epoch 6, batch 3550, loss[loss=0.209, simple_loss=0.2836, pruned_loss=0.06716, over 21726.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.3141, pruned_loss=0.08415, over 4259500.85 frames. ], batch size: 351, lr: 5.29e-03, grad_scale: 16.0 2023-06-23 19:42:49,618 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=936198.0, ans=0.125 2023-06-23 19:42:59,978 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=936258.0, ans=0.125 2023-06-23 19:43:21,750 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.27 vs. limit=12.0 2023-06-23 19:43:33,341 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=936378.0, ans=0.125 2023-06-23 19:43:51,990 INFO [train.py:996] (2/4) Epoch 6, batch 3600, loss[loss=0.2434, simple_loss=0.3086, pruned_loss=0.08914, over 21676.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3101, pruned_loss=0.08344, over 4262537.68 frames. ], batch size: 351, lr: 5.28e-03, grad_scale: 32.0 2023-06-23 19:43:52,620 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=936438.0, ans=0.0 2023-06-23 19:45:00,201 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.031e+02 2.636e+02 3.056e+02 3.547e+02 6.528e+02, threshold=6.113e+02, percent-clipped=1.0 2023-06-23 19:45:47,705 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.75 vs. limit=15.0 2023-06-23 19:45:48,081 INFO [train.py:996] (2/4) Epoch 6, batch 3650, loss[loss=0.2356, simple_loss=0.3209, pruned_loss=0.07518, over 21657.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3113, pruned_loss=0.08352, over 4270857.24 frames. ], batch size: 389, lr: 5.28e-03, grad_scale: 32.0 2023-06-23 19:46:38,406 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.55 vs. limit=15.0 2023-06-23 19:47:36,913 INFO [train.py:996] (2/4) Epoch 6, batch 3700, loss[loss=0.29, simple_loss=0.3687, pruned_loss=0.1057, over 21361.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3112, pruned_loss=0.08294, over 4272275.10 frames. ], batch size: 549, lr: 5.28e-03, grad_scale: 32.0 2023-06-23 19:47:44,742 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.13 vs. limit=12.0 2023-06-23 19:47:53,178 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=937038.0, ans=0.2 2023-06-23 19:48:38,104 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.081e+02 2.573e+02 2.941e+02 3.537e+02 5.018e+02, threshold=5.882e+02, percent-clipped=0.0 2023-06-23 19:49:03,860 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=937278.0, ans=0.1 2023-06-23 19:49:06,465 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.69 vs. limit=15.0 2023-06-23 19:49:12,654 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=937278.0, ans=0.1 2023-06-23 19:49:24,197 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=937278.0, ans=0.0 2023-06-23 19:49:27,043 INFO [train.py:996] (2/4) Epoch 6, batch 3750, loss[loss=0.2044, simple_loss=0.2854, pruned_loss=0.06173, over 21849.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3096, pruned_loss=0.08243, over 4280114.28 frames. ], batch size: 351, lr: 5.28e-03, grad_scale: 32.0 2023-06-23 19:50:35,872 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.37 vs. limit=15.0 2023-06-23 19:50:37,231 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=937518.0, ans=0.125 2023-06-23 19:51:14,627 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=937578.0, ans=0.2 2023-06-23 19:51:23,447 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=937578.0, ans=0.0 2023-06-23 19:51:29,603 INFO [train.py:996] (2/4) Epoch 6, batch 3800, loss[loss=0.2118, simple_loss=0.288, pruned_loss=0.06781, over 21785.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3067, pruned_loss=0.08013, over 4279334.17 frames. ], batch size: 247, lr: 5.28e-03, grad_scale: 16.0 2023-06-23 19:51:46,279 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=937638.0, ans=0.125 2023-06-23 19:51:56,660 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.53 vs. limit=22.5 2023-06-23 19:51:57,981 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=937698.0, ans=0.125 2023-06-23 19:52:10,893 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.52 vs. limit=15.0 2023-06-23 19:52:17,278 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=937758.0, ans=0.125 2023-06-23 19:52:21,710 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.562e+02 2.479e+02 2.831e+02 3.335e+02 6.491e+02, threshold=5.662e+02, percent-clipped=1.0 2023-06-23 19:52:29,756 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=937818.0, ans=0.125 2023-06-23 19:53:20,122 INFO [train.py:996] (2/4) Epoch 6, batch 3850, loss[loss=0.3023, simple_loss=0.4047, pruned_loss=0.09996, over 19980.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3058, pruned_loss=0.08056, over 4281624.53 frames. ], batch size: 702, lr: 5.28e-03, grad_scale: 16.0 2023-06-23 19:53:34,557 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=937938.0, ans=10.0 2023-06-23 19:53:36,224 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=937938.0, ans=0.09899494936611666 2023-06-23 19:53:37,864 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=937938.0, ans=0.125 2023-06-23 19:53:41,816 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.84 vs. limit=15.0 2023-06-23 19:53:45,085 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=937998.0, ans=0.125 2023-06-23 19:53:55,731 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.19 vs. limit=15.0 2023-06-23 19:55:09,884 INFO [train.py:996] (2/4) Epoch 6, batch 3900, loss[loss=0.2392, simple_loss=0.308, pruned_loss=0.08524, over 21827.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.3026, pruned_loss=0.08041, over 4276466.08 frames. ], batch size: 107, lr: 5.28e-03, grad_scale: 16.0 2023-06-23 19:55:21,994 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=8.72 vs. limit=15.0 2023-06-23 19:55:32,212 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=938298.0, ans=0.1 2023-06-23 19:55:37,251 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=938298.0, ans=0.125 2023-06-23 19:55:56,421 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=938358.0, ans=0.125 2023-06-23 19:56:02,986 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.187e+02 2.781e+02 3.101e+02 3.883e+02 8.958e+02, threshold=6.202e+02, percent-clipped=3.0 2023-06-23 19:56:29,148 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.96 vs. limit=15.0 2023-06-23 19:57:01,464 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=938478.0, ans=0.2 2023-06-23 19:57:06,059 INFO [train.py:996] (2/4) Epoch 6, batch 3950, loss[loss=0.2142, simple_loss=0.2586, pruned_loss=0.08491, over 20169.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3038, pruned_loss=0.08009, over 4277115.97 frames. ], batch size: 703, lr: 5.28e-03, grad_scale: 16.0 2023-06-23 19:57:10,228 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=938538.0, ans=0.2 2023-06-23 19:57:27,652 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=938598.0, ans=0.0 2023-06-23 19:57:45,178 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_ff3.min_abs, batch_count=938658.0, ans=0.2 2023-06-23 19:58:09,335 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=938718.0, ans=0.125 2023-06-23 19:58:56,511 INFO [train.py:996] (2/4) Epoch 6, batch 4000, loss[loss=0.2438, simple_loss=0.308, pruned_loss=0.08985, over 20613.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.2965, pruned_loss=0.07649, over 4276450.54 frames. ], batch size: 607, lr: 5.28e-03, grad_scale: 32.0 2023-06-23 19:59:13,995 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.out_whiten.whitening_limit, batch_count=938898.0, ans=8.0 2023-06-23 19:59:16,731 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 19:59:30,970 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=938958.0, ans=0.2 2023-06-23 19:59:44,418 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.908e+02 2.407e+02 2.711e+02 3.233e+02 5.039e+02, threshold=5.423e+02, percent-clipped=0.0 2023-06-23 20:00:47,404 INFO [train.py:996] (2/4) Epoch 6, batch 4050, loss[loss=0.2542, simple_loss=0.3119, pruned_loss=0.09825, over 21549.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.2962, pruned_loss=0.0752, over 4277835.21 frames. ], batch size: 441, lr: 5.28e-03, grad_scale: 32.0 2023-06-23 20:00:53,835 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.25 vs. limit=15.0 2023-06-23 20:01:16,046 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=939198.0, ans=0.1 2023-06-23 20:02:32,582 INFO [train.py:996] (2/4) Epoch 6, batch 4100, loss[loss=0.2088, simple_loss=0.2948, pruned_loss=0.06135, over 21607.00 frames. ], tot_loss[loss=0.223, simple_loss=0.2977, pruned_loss=0.07418, over 4275130.96 frames. ], batch size: 230, lr: 5.28e-03, grad_scale: 32.0 2023-06-23 20:03:19,667 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=939558.0, ans=0.2 2023-06-23 20:03:26,711 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.792e+02 2.413e+02 2.658e+02 3.099e+02 5.779e+02, threshold=5.316e+02, percent-clipped=1.0 2023-06-23 20:03:27,383 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=939558.0, ans=0.125 2023-06-23 20:04:14,287 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.13 vs. limit=15.0 2023-06-23 20:04:18,598 INFO [train.py:996] (2/4) Epoch 6, batch 4150, loss[loss=0.1687, simple_loss=0.2675, pruned_loss=0.03497, over 21542.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.2967, pruned_loss=0.0714, over 4280337.06 frames. ], batch size: 212, lr: 5.28e-03, grad_scale: 32.0 2023-06-23 20:04:24,932 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=939738.0, ans=0.125 2023-06-23 20:04:32,118 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=939738.0, ans=0.07 2023-06-23 20:04:36,190 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.41 vs. limit=22.5 2023-06-23 20:04:40,164 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.98 vs. limit=6.0 2023-06-23 20:05:10,784 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=939858.0, ans=0.125 2023-06-23 20:05:18,875 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=939858.0, ans=0.125 2023-06-23 20:05:47,623 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.79 vs. limit=10.0 2023-06-23 20:05:52,912 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=939978.0, ans=0.125 2023-06-23 20:06:12,067 INFO [train.py:996] (2/4) Epoch 6, batch 4200, loss[loss=0.1961, simple_loss=0.275, pruned_loss=0.05857, over 21559.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.2966, pruned_loss=0.07133, over 4275458.06 frames. ], batch size: 195, lr: 5.27e-03, grad_scale: 32.0 2023-06-23 20:07:04,822 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=940158.0, ans=0.125 2023-06-23 20:07:12,843 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.61 vs. limit=15.0 2023-06-23 20:07:18,280 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.688e+02 2.286e+02 2.656e+02 3.507e+02 6.693e+02, threshold=5.313e+02, percent-clipped=3.0 2023-06-23 20:07:38,511 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=940218.0, ans=0.125 2023-06-23 20:08:05,643 INFO [train.py:996] (2/4) Epoch 6, batch 4250, loss[loss=0.2631, simple_loss=0.3395, pruned_loss=0.0933, over 21718.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.3048, pruned_loss=0.07394, over 4266317.45 frames. ], batch size: 351, lr: 5.27e-03, grad_scale: 16.0 2023-06-23 20:08:27,425 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=940338.0, ans=0.0 2023-06-23 20:08:50,825 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.00 vs. limit=12.0 2023-06-23 20:09:26,367 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=940518.0, ans=0.2 2023-06-23 20:09:33,994 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=940518.0, ans=0.125 2023-06-23 20:09:59,081 INFO [train.py:996] (2/4) Epoch 6, batch 4300, loss[loss=0.2536, simple_loss=0.3524, pruned_loss=0.07742, over 21637.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3118, pruned_loss=0.07615, over 4273642.90 frames. ], batch size: 414, lr: 5.27e-03, grad_scale: 16.0 2023-06-23 20:10:31,466 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=940698.0, ans=0.1 2023-06-23 20:10:59,330 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=940758.0, ans=0.1 2023-06-23 20:11:11,076 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.144e+02 2.724e+02 3.223e+02 4.213e+02 6.998e+02, threshold=6.446e+02, percent-clipped=6.0 2023-06-23 20:11:15,057 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=940818.0, ans=0.0 2023-06-23 20:11:22,211 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=940818.0, ans=0.04949747468305833 2023-06-23 20:11:48,593 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=940878.0, ans=0.0 2023-06-23 20:12:00,258 INFO [train.py:996] (2/4) Epoch 6, batch 4350, loss[loss=0.2186, simple_loss=0.2864, pruned_loss=0.07541, over 21885.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.3093, pruned_loss=0.07515, over 4274436.60 frames. ], batch size: 107, lr: 5.27e-03, grad_scale: 16.0 2023-06-23 20:12:13,202 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=940938.0, ans=0.125 2023-06-23 20:12:46,732 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.17 vs. limit=10.0 2023-06-23 20:13:04,440 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.61 vs. limit=15.0 2023-06-23 20:13:52,005 INFO [train.py:996] (2/4) Epoch 6, batch 4400, loss[loss=0.203, simple_loss=0.2711, pruned_loss=0.06744, over 21200.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.3047, pruned_loss=0.07486, over 4275327.50 frames. ], batch size: 548, lr: 5.27e-03, grad_scale: 32.0 2023-06-23 20:14:10,801 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=941238.0, ans=0.125 2023-06-23 20:14:53,321 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.950e+02 2.531e+02 2.865e+02 3.462e+02 7.210e+02, threshold=5.730e+02, percent-clipped=2.0 2023-06-23 20:15:43,319 INFO [train.py:996] (2/4) Epoch 6, batch 4450, loss[loss=0.3444, simple_loss=0.432, pruned_loss=0.1284, over 21523.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3128, pruned_loss=0.07651, over 4271875.86 frames. ], batch size: 471, lr: 5.27e-03, grad_scale: 32.0 2023-06-23 20:16:02,532 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=941538.0, ans=0.125 2023-06-23 20:16:47,175 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.09 vs. limit=15.0 2023-06-23 20:16:50,162 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=941718.0, ans=0.1 2023-06-23 20:17:37,734 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=941838.0, ans=0.1 2023-06-23 20:17:38,984 INFO [train.py:996] (2/4) Epoch 6, batch 4500, loss[loss=0.2301, simple_loss=0.3187, pruned_loss=0.07077, over 21682.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.314, pruned_loss=0.07878, over 4281118.55 frames. ], batch size: 263, lr: 5.27e-03, grad_scale: 32.0 2023-06-23 20:18:18,096 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.94 vs. limit=15.0 2023-06-23 20:18:32,700 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.004e+02 2.439e+02 2.793e+02 3.421e+02 5.110e+02, threshold=5.586e+02, percent-clipped=0.0 2023-06-23 20:18:57,660 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=942018.0, ans=0.125 2023-06-23 20:19:33,657 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=942138.0, ans=0.1 2023-06-23 20:19:34,562 INFO [train.py:996] (2/4) Epoch 6, batch 4550, loss[loss=0.3244, simple_loss=0.3762, pruned_loss=0.1363, over 21321.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3173, pruned_loss=0.07933, over 4279542.64 frames. ], batch size: 507, lr: 5.27e-03, grad_scale: 32.0 2023-06-23 20:19:50,132 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.78 vs. limit=15.0 2023-06-23 20:20:41,296 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.26 vs. limit=15.0 2023-06-23 20:20:42,501 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=942318.0, ans=10.0 2023-06-23 20:21:25,304 INFO [train.py:996] (2/4) Epoch 6, batch 4600, loss[loss=0.2437, simple_loss=0.3325, pruned_loss=0.07745, over 21611.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.32, pruned_loss=0.08112, over 4282358.45 frames. ], batch size: 414, lr: 5.27e-03, grad_scale: 32.0 2023-06-23 20:21:43,292 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=942498.0, ans=0.07 2023-06-23 20:21:50,738 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=942498.0, ans=0.04949747468305833 2023-06-23 20:21:55,951 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=942498.0, ans=0.125 2023-06-23 20:22:25,592 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.170e+02 2.585e+02 3.169e+02 3.580e+02 7.815e+02, threshold=6.337e+02, percent-clipped=3.0 2023-06-23 20:22:34,995 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=942618.0, ans=0.0 2023-06-23 20:22:35,035 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=942618.0, ans=0.2 2023-06-23 20:23:13,752 INFO [train.py:996] (2/4) Epoch 6, batch 4650, loss[loss=0.1721, simple_loss=0.248, pruned_loss=0.04813, over 21753.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.314, pruned_loss=0.07981, over 4287682.12 frames. ], batch size: 298, lr: 5.27e-03, grad_scale: 32.0 2023-06-23 20:24:02,784 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.73 vs. limit=15.0 2023-06-23 20:24:14,987 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=942858.0, ans=0.125 2023-06-23 20:24:55,203 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=942978.0, ans=0.2 2023-06-23 20:25:03,305 INFO [train.py:996] (2/4) Epoch 6, batch 4700, loss[loss=0.2114, simple_loss=0.2695, pruned_loss=0.07666, over 21265.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3038, pruned_loss=0.07678, over 4283115.26 frames. ], batch size: 144, lr: 5.27e-03, grad_scale: 16.0 2023-06-23 20:25:26,100 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=943098.0, ans=0.125 2023-06-23 20:26:04,154 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.777e+02 2.385e+02 2.698e+02 3.095e+02 5.090e+02, threshold=5.395e+02, percent-clipped=0.0 2023-06-23 20:26:27,173 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=943278.0, ans=0.0 2023-06-23 20:26:27,187 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=943278.0, ans=0.125 2023-06-23 20:26:48,307 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.20 vs. limit=6.0 2023-06-23 20:26:50,591 INFO [train.py:996] (2/4) Epoch 6, batch 4750, loss[loss=0.2367, simple_loss=0.3003, pruned_loss=0.08657, over 21859.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.2977, pruned_loss=0.07605, over 4286411.85 frames. ], batch size: 351, lr: 5.27e-03, grad_scale: 16.0 2023-06-23 20:28:31,719 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.32 vs. limit=15.0 2023-06-23 20:28:39,517 INFO [train.py:996] (2/4) Epoch 6, batch 4800, loss[loss=0.2404, simple_loss=0.3457, pruned_loss=0.06751, over 21682.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.2985, pruned_loss=0.07708, over 4293797.51 frames. ], batch size: 414, lr: 5.26e-03, grad_scale: 32.0 2023-06-23 20:29:07,314 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=943698.0, ans=0.2 2023-06-23 20:29:21,325 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=943698.0, ans=0.1 2023-06-23 20:29:42,949 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.177e+02 2.734e+02 3.125e+02 3.511e+02 5.007e+02, threshold=6.249e+02, percent-clipped=0.0 2023-06-23 20:30:12,438 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=943878.0, ans=0.0 2023-06-23 20:30:15,533 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=943878.0, ans=0.125 2023-06-23 20:30:27,086 INFO [train.py:996] (2/4) Epoch 6, batch 4850, loss[loss=0.2593, simple_loss=0.3253, pruned_loss=0.0966, over 21667.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.2984, pruned_loss=0.07662, over 4294377.84 frames. ], batch size: 441, lr: 5.26e-03, grad_scale: 16.0 2023-06-23 20:30:31,160 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=943938.0, ans=0.0 2023-06-23 20:30:43,445 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=943998.0, ans=0.04949747468305833 2023-06-23 20:31:32,406 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=944118.0, ans=0.125 2023-06-23 20:31:37,677 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=944118.0, ans=0.125 2023-06-23 20:32:14,747 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=944178.0, ans=0.125 2023-06-23 20:32:17,553 INFO [train.py:996] (2/4) Epoch 6, batch 4900, loss[loss=0.2402, simple_loss=0.3143, pruned_loss=0.08307, over 21305.00 frames. ], tot_loss[loss=0.2266, simple_loss=0.2989, pruned_loss=0.07713, over 4281495.84 frames. ], batch size: 159, lr: 5.26e-03, grad_scale: 16.0 2023-06-23 20:32:32,678 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=944238.0, ans=0.0 2023-06-23 20:33:01,056 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.66 vs. limit=15.0 2023-06-23 20:33:10,834 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=944358.0, ans=0.125 2023-06-23 20:33:28,630 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.946e+02 2.472e+02 2.764e+02 3.016e+02 5.453e+02, threshold=5.528e+02, percent-clipped=0.0 2023-06-23 20:33:50,479 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=944478.0, ans=0.125 2023-06-23 20:34:09,898 INFO [train.py:996] (2/4) Epoch 6, batch 4950, loss[loss=0.1938, simple_loss=0.2923, pruned_loss=0.04765, over 21611.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.3038, pruned_loss=0.076, over 4275634.90 frames. ], batch size: 389, lr: 5.26e-03, grad_scale: 16.0 2023-06-23 20:35:07,392 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=944658.0, ans=0.1 2023-06-23 20:35:25,332 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.65 vs. limit=15.0 2023-06-23 20:35:33,044 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=944718.0, ans=0.125 2023-06-23 20:35:57,714 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=944838.0, ans=15.0 2023-06-23 20:35:58,236 INFO [train.py:996] (2/4) Epoch 6, batch 5000, loss[loss=0.2233, simple_loss=0.2976, pruned_loss=0.0745, over 21469.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.3035, pruned_loss=0.07272, over 4270519.34 frames. ], batch size: 211, lr: 5.26e-03, grad_scale: 16.0 2023-06-23 20:35:59,402 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.75 vs. limit=10.0 2023-06-23 20:36:37,677 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.85 vs. limit=15.0 2023-06-23 20:36:39,856 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.89 vs. limit=15.0 2023-06-23 20:36:48,548 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=944958.0, ans=0.125 2023-06-23 20:37:01,263 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.831e+02 2.469e+02 2.951e+02 3.464e+02 5.172e+02, threshold=5.903e+02, percent-clipped=0.0 2023-06-23 20:37:02,084 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 20:37:40,310 INFO [train.py:996] (2/4) Epoch 6, batch 5050, loss[loss=0.2211, simple_loss=0.2912, pruned_loss=0.07545, over 21559.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.3032, pruned_loss=0.07473, over 4276475.53 frames. ], batch size: 212, lr: 5.26e-03, grad_scale: 16.0 2023-06-23 20:38:33,306 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.98 vs. limit=15.0 2023-06-23 20:38:45,882 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.53 vs. limit=22.5 2023-06-23 20:38:58,214 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=945318.0, ans=0.125 2023-06-23 20:39:18,817 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=945378.0, ans=0.125 2023-06-23 20:39:26,500 INFO [train.py:996] (2/4) Epoch 6, batch 5100, loss[loss=0.2564, simple_loss=0.3635, pruned_loss=0.07463, over 19796.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.3021, pruned_loss=0.07421, over 4278022.66 frames. ], batch size: 703, lr: 5.26e-03, grad_scale: 16.0 2023-06-23 20:39:52,418 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=12.96 vs. limit=15.0 2023-06-23 20:40:03,186 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.63 vs. limit=15.0 2023-06-23 20:40:30,255 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.074e+02 2.802e+02 3.209e+02 3.785e+02 5.711e+02, threshold=6.418e+02, percent-clipped=0.0 2023-06-23 20:40:44,784 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=945618.0, ans=0.1 2023-06-23 20:41:07,830 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=945678.0, ans=0.125 2023-06-23 20:41:15,813 INFO [train.py:996] (2/4) Epoch 6, batch 5150, loss[loss=0.2315, simple_loss=0.3112, pruned_loss=0.07589, over 21838.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.3006, pruned_loss=0.07524, over 4286780.05 frames. ], batch size: 332, lr: 5.26e-03, grad_scale: 16.0 2023-06-23 20:41:16,650 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=945738.0, ans=0.125 2023-06-23 20:41:50,466 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=945798.0, ans=0.125 2023-06-23 20:41:55,869 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=945798.0, ans=0.125 2023-06-23 20:41:58,067 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.89 vs. limit=12.0 2023-06-23 20:42:30,752 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=945918.0, ans=0.2 2023-06-23 20:42:41,404 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=945918.0, ans=0.125 2023-06-23 20:42:45,267 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=945918.0, ans=0.1 2023-06-23 20:43:05,740 INFO [train.py:996] (2/4) Epoch 6, batch 5200, loss[loss=0.2248, simple_loss=0.3114, pruned_loss=0.06912, over 21410.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.3048, pruned_loss=0.0771, over 4287085.41 frames. ], batch size: 194, lr: 5.26e-03, grad_scale: 32.0 2023-06-23 20:44:14,592 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.120e+02 2.657e+02 3.031e+02 3.767e+02 5.750e+02, threshold=6.062e+02, percent-clipped=0.0 2023-06-23 20:44:26,703 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=946218.0, ans=0.2 2023-06-23 20:44:55,153 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=946278.0, ans=0.125 2023-06-23 20:44:59,546 INFO [train.py:996] (2/4) Epoch 6, batch 5250, loss[loss=0.2433, simple_loss=0.3322, pruned_loss=0.07723, over 21740.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.3084, pruned_loss=0.07571, over 4272470.02 frames. ], batch size: 414, lr: 5.26e-03, grad_scale: 32.0 2023-06-23 20:45:51,196 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=946458.0, ans=0.125 2023-06-23 20:46:52,855 INFO [train.py:996] (2/4) Epoch 6, batch 5300, loss[loss=0.2028, simple_loss=0.3061, pruned_loss=0.04976, over 19726.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.307, pruned_loss=0.07533, over 4266400.64 frames. ], batch size: 703, lr: 5.26e-03, grad_scale: 32.0 2023-06-23 20:47:25,280 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=946698.0, ans=0.125 2023-06-23 20:47:49,075 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=946818.0, ans=0.025 2023-06-23 20:47:55,269 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.995e+02 2.539e+02 2.781e+02 3.236e+02 4.836e+02, threshold=5.563e+02, percent-clipped=0.0 2023-06-23 20:47:56,110 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=946818.0, ans=0.2 2023-06-23 20:48:01,214 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=946818.0, ans=0.015 2023-06-23 20:48:02,097 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.75 vs. limit=15.0 2023-06-23 20:48:29,294 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=946878.0, ans=0.125 2023-06-23 20:48:35,179 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 20:48:41,807 INFO [train.py:996] (2/4) Epoch 6, batch 5350, loss[loss=0.2204, simple_loss=0.2941, pruned_loss=0.07333, over 21552.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.3044, pruned_loss=0.07611, over 4279172.22 frames. ], batch size: 131, lr: 5.26e-03, grad_scale: 32.0 2023-06-23 20:48:58,789 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=9.40 vs. limit=15.0 2023-06-23 20:50:00,269 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.91 vs. limit=12.0 2023-06-23 20:50:27,177 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=947178.0, ans=0.125 2023-06-23 20:50:29,934 INFO [train.py:996] (2/4) Epoch 6, batch 5400, loss[loss=0.2327, simple_loss=0.307, pruned_loss=0.07919, over 21734.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.303, pruned_loss=0.07698, over 4285895.57 frames. ], batch size: 441, lr: 5.25e-03, grad_scale: 16.0 2023-06-23 20:50:55,359 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=947298.0, ans=0.125 2023-06-23 20:51:07,398 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=947298.0, ans=0.1 2023-06-23 20:51:18,112 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=947358.0, ans=0.2 2023-06-23 20:51:23,217 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=947358.0, ans=0.0 2023-06-23 20:51:34,439 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.034e+02 2.654e+02 3.257e+02 3.898e+02 6.722e+02, threshold=6.513e+02, percent-clipped=2.0 2023-06-23 20:51:42,136 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=947418.0, ans=0.125 2023-06-23 20:51:53,495 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=947418.0, ans=0.0 2023-06-23 20:52:16,672 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=947478.0, ans=0.125 2023-06-23 20:52:19,550 INFO [train.py:996] (2/4) Epoch 6, batch 5450, loss[loss=0.213, simple_loss=0.2955, pruned_loss=0.06523, over 21177.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.3026, pruned_loss=0.07537, over 4283614.62 frames. ], batch size: 143, lr: 5.25e-03, grad_scale: 16.0 2023-06-23 20:52:57,146 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=947598.0, ans=0.125 2023-06-23 20:53:45,394 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=947718.0, ans=0.1 2023-06-23 20:54:09,234 INFO [train.py:996] (2/4) Epoch 6, batch 5500, loss[loss=0.2089, simple_loss=0.3082, pruned_loss=0.05481, over 21706.00 frames. ], tot_loss[loss=0.227, simple_loss=0.3078, pruned_loss=0.0731, over 4287807.81 frames. ], batch size: 298, lr: 5.25e-03, grad_scale: 16.0 2023-06-23 20:55:24,832 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.705e+02 2.255e+02 2.654e+02 3.007e+02 4.668e+02, threshold=5.308e+02, percent-clipped=0.0 2023-06-23 20:55:56,417 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.76 vs. limit=12.0 2023-06-23 20:56:04,065 INFO [train.py:996] (2/4) Epoch 6, batch 5550, loss[loss=0.2306, simple_loss=0.3281, pruned_loss=0.06654, over 21434.00 frames. ], tot_loss[loss=0.223, simple_loss=0.3059, pruned_loss=0.07007, over 4277136.42 frames. ], batch size: 471, lr: 5.25e-03, grad_scale: 16.0 2023-06-23 20:56:12,594 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.61 vs. limit=15.0 2023-06-23 20:56:35,713 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=948198.0, ans=0.1 2023-06-23 20:56:35,754 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=948198.0, ans=0.125 2023-06-23 20:56:57,531 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=948258.0, ans=0.2 2023-06-23 20:57:00,138 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.68 vs. limit=10.0 2023-06-23 20:57:56,429 INFO [train.py:996] (2/4) Epoch 6, batch 5600, loss[loss=0.3327, simple_loss=0.4164, pruned_loss=0.1245, over 21523.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.3045, pruned_loss=0.06744, over 4280969.24 frames. ], batch size: 471, lr: 5.25e-03, grad_scale: 32.0 2023-06-23 20:58:21,993 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=948498.0, ans=0.125 2023-06-23 20:58:58,514 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=948618.0, ans=0.125 2023-06-23 20:59:01,222 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.636e+02 2.332e+02 2.800e+02 3.364e+02 5.770e+02, threshold=5.601e+02, percent-clipped=3.0 2023-06-23 20:59:10,777 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=948618.0, ans=0.2 2023-06-23 20:59:44,403 INFO [train.py:996] (2/4) Epoch 6, batch 5650, loss[loss=0.2168, simple_loss=0.2935, pruned_loss=0.07009, over 21872.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.3084, pruned_loss=0.07032, over 4280444.41 frames. ], batch size: 298, lr: 5.25e-03, grad_scale: 32.0 2023-06-23 20:59:47,467 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.00 vs. limit=10.0 2023-06-23 20:59:49,330 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.56 vs. limit=15.0 2023-06-23 21:01:28,329 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=949038.0, ans=0.125 2023-06-23 21:01:29,430 INFO [train.py:996] (2/4) Epoch 6, batch 5700, loss[loss=0.2477, simple_loss=0.3348, pruned_loss=0.08026, over 21562.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.31, pruned_loss=0.07238, over 4282998.36 frames. ], batch size: 441, lr: 5.25e-03, grad_scale: 32.0 2023-06-23 21:01:58,326 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=949098.0, ans=0.2 2023-06-23 21:02:28,652 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=949158.0, ans=0.1 2023-06-23 21:02:41,755 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.056e+02 2.515e+02 2.975e+02 3.453e+02 5.794e+02, threshold=5.950e+02, percent-clipped=1.0 2023-06-23 21:03:00,991 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.68 vs. limit=15.0 2023-06-23 21:03:31,995 INFO [train.py:996] (2/4) Epoch 6, batch 5750, loss[loss=0.1481, simple_loss=0.2283, pruned_loss=0.03398, over 21359.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.3066, pruned_loss=0.07011, over 4280678.55 frames. ], batch size: 176, lr: 5.25e-03, grad_scale: 32.0 2023-06-23 21:05:18,252 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=949578.0, ans=0.125 2023-06-23 21:05:22,450 INFO [train.py:996] (2/4) Epoch 6, batch 5800, loss[loss=0.2405, simple_loss=0.3389, pruned_loss=0.07105, over 21815.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.3044, pruned_loss=0.06863, over 4274806.68 frames. ], batch size: 371, lr: 5.25e-03, grad_scale: 32.0 2023-06-23 21:06:27,700 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.707e+02 2.304e+02 2.799e+02 4.068e+02 6.558e+02, threshold=5.598e+02, percent-clipped=2.0 2023-06-23 21:07:12,468 INFO [train.py:996] (2/4) Epoch 6, batch 5850, loss[loss=0.1718, simple_loss=0.2684, pruned_loss=0.03758, over 21421.00 frames. ], tot_loss[loss=0.215, simple_loss=0.3011, pruned_loss=0.06441, over 4276204.55 frames. ], batch size: 211, lr: 5.25e-03, grad_scale: 32.0 2023-06-23 21:07:16,560 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=949938.0, ans=0.0 2023-06-23 21:08:52,467 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=950178.0, ans=0.125 2023-06-23 21:08:55,271 INFO [train.py:996] (2/4) Epoch 6, batch 5900, loss[loss=0.2025, simple_loss=0.2824, pruned_loss=0.06132, over 21892.00 frames. ], tot_loss[loss=0.2066, simple_loss=0.2938, pruned_loss=0.05971, over 4277207.52 frames. ], batch size: 371, lr: 5.25e-03, grad_scale: 32.0 2023-06-23 21:09:26,045 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=950298.0, ans=0.0 2023-06-23 21:09:39,786 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=950358.0, ans=0.125 2023-06-23 21:09:57,809 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.381e+02 1.988e+02 2.407e+02 3.041e+02 4.833e+02, threshold=4.814e+02, percent-clipped=0.0 2023-06-23 21:10:20,778 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.51 vs. limit=15.0 2023-06-23 21:10:41,832 INFO [train.py:996] (2/4) Epoch 6, batch 5950, loss[loss=0.2085, simple_loss=0.274, pruned_loss=0.0715, over 21408.00 frames. ], tot_loss[loss=0.2091, simple_loss=0.2924, pruned_loss=0.0629, over 4278837.14 frames. ], batch size: 194, lr: 5.25e-03, grad_scale: 32.0 2023-06-23 21:11:21,200 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=950598.0, ans=0.0 2023-06-23 21:11:28,852 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.17 vs. limit=8.0 2023-06-23 21:12:08,039 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=950718.0, ans=0.0 2023-06-23 21:12:15,764 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.35 vs. limit=15.0 2023-06-23 21:12:30,051 INFO [train.py:996] (2/4) Epoch 6, batch 6000, loss[loss=0.2013, simple_loss=0.2622, pruned_loss=0.07017, over 21270.00 frames. ], tot_loss[loss=0.2105, simple_loss=0.2885, pruned_loss=0.06622, over 4265754.51 frames. ], batch size: 176, lr: 5.24e-03, grad_scale: 32.0 2023-06-23 21:12:30,052 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-23 21:12:53,048 INFO [train.py:1028] (2/4) Epoch 6, validation: loss=0.2596, simple_loss=0.3528, pruned_loss=0.08322, over 1796401.00 frames. 2023-06-23 21:12:53,049 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 23731MB 2023-06-23 21:12:55,723 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=950838.0, ans=0.125 2023-06-23 21:13:16,241 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=950898.0, ans=0.0 2023-06-23 21:14:01,810 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.76 vs. limit=15.0 2023-06-23 21:14:03,939 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.015e+02 2.620e+02 2.865e+02 3.269e+02 5.211e+02, threshold=5.729e+02, percent-clipped=1.0 2023-06-23 21:14:14,033 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.88 vs. limit=12.0 2023-06-23 21:14:27,875 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=951078.0, ans=0.1 2023-06-23 21:14:38,562 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=951078.0, ans=0.05 2023-06-23 21:14:48,479 INFO [train.py:996] (2/4) Epoch 6, batch 6050, loss[loss=0.1943, simple_loss=0.2571, pruned_loss=0.06576, over 21616.00 frames. ], tot_loss[loss=0.209, simple_loss=0.284, pruned_loss=0.067, over 4262869.82 frames. ], batch size: 298, lr: 5.24e-03, grad_scale: 32.0 2023-06-23 21:15:15,235 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=951198.0, ans=0.125 2023-06-23 21:16:08,800 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=951378.0, ans=0.125 2023-06-23 21:16:30,438 INFO [train.py:996] (2/4) Epoch 6, batch 6100, loss[loss=0.2268, simple_loss=0.3048, pruned_loss=0.07439, over 21712.00 frames. ], tot_loss[loss=0.207, simple_loss=0.2825, pruned_loss=0.06575, over 4264412.34 frames. ], batch size: 389, lr: 5.24e-03, grad_scale: 32.0 2023-06-23 21:16:54,089 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=951498.0, ans=0.125 2023-06-23 21:17:40,857 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.589e+02 2.204e+02 2.422e+02 2.717e+02 3.811e+02, threshold=4.844e+02, percent-clipped=0.0 2023-06-23 21:17:55,768 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=5.65 vs. limit=15.0 2023-06-23 21:17:56,962 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=951678.0, ans=0.2 2023-06-23 21:18:18,535 INFO [train.py:996] (2/4) Epoch 6, batch 6150, loss[loss=0.214, simple_loss=0.2909, pruned_loss=0.06854, over 21750.00 frames. ], tot_loss[loss=0.2115, simple_loss=0.286, pruned_loss=0.06846, over 4276599.56 frames. ], batch size: 124, lr: 5.24e-03, grad_scale: 32.0 2023-06-23 21:18:19,167 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=951738.0, ans=0.125 2023-06-23 21:18:36,590 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=951798.0, ans=0.125 2023-06-23 21:18:41,932 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=951798.0, ans=0.0 2023-06-23 21:18:59,512 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=951858.0, ans=0.2 2023-06-23 21:19:10,394 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=951858.0, ans=0.125 2023-06-23 21:19:17,441 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=951858.0, ans=0.0 2023-06-23 21:19:19,803 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.44 vs. limit=15.0 2023-06-23 21:20:08,104 INFO [train.py:996] (2/4) Epoch 6, batch 6200, loss[loss=0.264, simple_loss=0.3435, pruned_loss=0.09221, over 21494.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2901, pruned_loss=0.06891, over 4277210.61 frames. ], batch size: 548, lr: 5.24e-03, grad_scale: 16.0 2023-06-23 21:20:14,569 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.33 vs. limit=12.0 2023-06-23 21:20:25,398 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.16 vs. limit=10.0 2023-06-23 21:20:48,062 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=952098.0, ans=0.125 2023-06-23 21:20:56,846 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=952158.0, ans=0.1 2023-06-23 21:21:04,557 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.62 vs. limit=22.5 2023-06-23 21:21:15,541 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.781e+02 2.446e+02 2.781e+02 3.201e+02 6.151e+02, threshold=5.562e+02, percent-clipped=2.0 2023-06-23 21:21:53,546 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=952278.0, ans=0.125 2023-06-23 21:21:58,217 INFO [train.py:996] (2/4) Epoch 6, batch 6250, loss[loss=0.2439, simple_loss=0.3491, pruned_loss=0.06937, over 21769.00 frames. ], tot_loss[loss=0.2172, simple_loss=0.2968, pruned_loss=0.06885, over 4282814.25 frames. ], batch size: 332, lr: 5.24e-03, grad_scale: 16.0 2023-06-23 21:22:52,355 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=952458.0, ans=0.125 2023-06-23 21:23:28,591 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=952578.0, ans=0.125 2023-06-23 21:23:45,367 INFO [train.py:996] (2/4) Epoch 6, batch 6300, loss[loss=0.2355, simple_loss=0.3084, pruned_loss=0.08133, over 21932.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.299, pruned_loss=0.06801, over 4284841.95 frames. ], batch size: 113, lr: 5.24e-03, grad_scale: 16.0 2023-06-23 21:23:46,639 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.10 vs. limit=15.0 2023-06-23 21:23:49,622 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=952638.0, ans=0.125 2023-06-23 21:23:58,271 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=952638.0, ans=0.125 2023-06-23 21:24:48,202 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=952818.0, ans=0.125 2023-06-23 21:24:57,677 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.792e+02 2.558e+02 3.046e+02 3.782e+02 6.709e+02, threshold=6.092e+02, percent-clipped=4.0 2023-06-23 21:25:07,305 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=952818.0, ans=0.04949747468305833 2023-06-23 21:25:23,792 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.52 vs. limit=15.0 2023-06-23 21:25:34,564 INFO [train.py:996] (2/4) Epoch 6, batch 6350, loss[loss=0.249, simple_loss=0.3147, pruned_loss=0.09167, over 21622.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.3037, pruned_loss=0.07244, over 4291227.45 frames. ], batch size: 263, lr: 5.24e-03, grad_scale: 16.0 2023-06-23 21:26:05,575 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=952998.0, ans=0.125 2023-06-23 21:26:17,989 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=952998.0, ans=0.125 2023-06-23 21:26:47,176 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=953118.0, ans=0.125 2023-06-23 21:26:49,353 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=953118.0, ans=0.125 2023-06-23 21:26:56,151 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=953118.0, ans=0.125 2023-06-23 21:26:59,844 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=953118.0, ans=0.125 2023-06-23 21:27:01,551 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=953118.0, ans=0.1 2023-06-23 21:27:05,574 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=953178.0, ans=0.125 2023-06-23 21:27:29,910 INFO [train.py:996] (2/4) Epoch 6, batch 6400, loss[loss=0.2391, simple_loss=0.3085, pruned_loss=0.08487, over 21361.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.3092, pruned_loss=0.07686, over 4294084.91 frames. ], batch size: 548, lr: 5.24e-03, grad_scale: 32.0 2023-06-23 21:27:52,672 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=953298.0, ans=0.07 2023-06-23 21:28:42,598 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.145e+02 2.766e+02 2.997e+02 3.346e+02 4.721e+02, threshold=5.994e+02, percent-clipped=0.0 2023-06-23 21:28:58,763 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=953478.0, ans=0.125 2023-06-23 21:29:24,309 INFO [train.py:996] (2/4) Epoch 6, batch 6450, loss[loss=0.1907, simple_loss=0.2612, pruned_loss=0.06009, over 21829.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3116, pruned_loss=0.07603, over 4295780.67 frames. ], batch size: 107, lr: 5.24e-03, grad_scale: 32.0 2023-06-23 21:29:26,320 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=953538.0, ans=0.125 2023-06-23 21:30:35,700 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.26 vs. limit=6.0 2023-06-23 21:31:13,665 INFO [train.py:996] (2/4) Epoch 6, batch 6500, loss[loss=0.2651, simple_loss=0.3707, pruned_loss=0.0798, over 19747.00 frames. ], tot_loss[loss=0.2276, simple_loss=0.3065, pruned_loss=0.07435, over 4281172.50 frames. ], batch size: 703, lr: 5.24e-03, grad_scale: 32.0 2023-06-23 21:31:21,124 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=953838.0, ans=0.0 2023-06-23 21:32:05,573 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=953958.0, ans=0.2 2023-06-23 21:32:18,650 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.973e+02 2.470e+02 2.695e+02 2.978e+02 5.209e+02, threshold=5.391e+02, percent-clipped=0.0 2023-06-23 21:32:39,283 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=954078.0, ans=0.0 2023-06-23 21:32:52,431 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.whiten.whitening_limit, batch_count=954078.0, ans=12.0 2023-06-23 21:32:56,410 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=954078.0, ans=0.015 2023-06-23 21:33:01,257 INFO [train.py:996] (2/4) Epoch 6, batch 6550, loss[loss=0.2236, simple_loss=0.2984, pruned_loss=0.07433, over 21707.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.3039, pruned_loss=0.07291, over 4267928.03 frames. ], batch size: 389, lr: 5.24e-03, grad_scale: 32.0 2023-06-23 21:33:25,480 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=954198.0, ans=0.0 2023-06-23 21:33:46,402 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.37 vs. limit=22.5 2023-06-23 21:33:47,431 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=954258.0, ans=0.125 2023-06-23 21:34:34,984 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=954378.0, ans=0.125 2023-06-23 21:34:46,663 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=954438.0, ans=0.125 2023-06-23 21:34:47,739 INFO [train.py:996] (2/4) Epoch 6, batch 6600, loss[loss=0.2096, simple_loss=0.2715, pruned_loss=0.0739, over 21563.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.2977, pruned_loss=0.07209, over 4273299.91 frames. ], batch size: 391, lr: 5.23e-03, grad_scale: 8.0 2023-06-23 21:34:58,754 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=954438.0, ans=0.125 2023-06-23 21:35:49,946 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=954618.0, ans=0.0 2023-06-23 21:36:01,796 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.786e+02 2.286e+02 2.575e+02 2.928e+02 5.219e+02, threshold=5.150e+02, percent-clipped=0.0 2023-06-23 21:36:35,425 INFO [train.py:996] (2/4) Epoch 6, batch 6650, loss[loss=0.1839, simple_loss=0.2536, pruned_loss=0.05713, over 21596.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2892, pruned_loss=0.06974, over 4273869.94 frames. ], batch size: 247, lr: 5.23e-03, grad_scale: 8.0 2023-06-23 21:36:36,127 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=954738.0, ans=0.125 2023-06-23 21:36:42,115 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.19 vs. limit=15.0 2023-06-23 21:37:21,495 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=954858.0, ans=0.125 2023-06-23 21:37:28,566 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=954858.0, ans=0.125 2023-06-23 21:37:30,423 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=954858.0, ans=0.125 2023-06-23 21:38:08,583 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=954978.0, ans=6.0 2023-06-23 21:38:12,949 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=954978.0, ans=0.125 2023-06-23 21:38:18,826 INFO [train.py:996] (2/4) Epoch 6, batch 6700, loss[loss=0.2065, simple_loss=0.2786, pruned_loss=0.06716, over 21551.00 frames. ], tot_loss[loss=0.2115, simple_loss=0.2833, pruned_loss=0.0698, over 4272105.07 frames. ], batch size: 230, lr: 5.23e-03, grad_scale: 8.0 2023-06-23 21:38:39,248 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.77 vs. limit=22.5 2023-06-23 21:39:34,441 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.914e+02 2.289e+02 2.607e+02 3.016e+02 4.316e+02, threshold=5.215e+02, percent-clipped=0.0 2023-06-23 21:40:07,903 INFO [train.py:996] (2/4) Epoch 6, batch 6750, loss[loss=0.2105, simple_loss=0.2809, pruned_loss=0.07004, over 21782.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.2818, pruned_loss=0.07023, over 4278402.00 frames. ], batch size: 112, lr: 5.23e-03, grad_scale: 8.0 2023-06-23 21:40:09,191 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.97 vs. limit=22.5 2023-06-23 21:40:09,198 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.47 vs. limit=15.0 2023-06-23 21:40:10,134 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 21:41:09,800 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=955518.0, ans=0.125 2023-06-23 21:41:55,004 INFO [train.py:996] (2/4) Epoch 6, batch 6800, loss[loss=0.2365, simple_loss=0.2961, pruned_loss=0.08848, over 21743.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.2861, pruned_loss=0.0738, over 4275996.11 frames. ], batch size: 112, lr: 5.23e-03, grad_scale: 16.0 2023-06-23 21:41:59,168 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=955638.0, ans=0.2 2023-06-23 21:42:02,818 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.79 vs. limit=15.0 2023-06-23 21:42:36,801 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=955698.0, ans=0.125 2023-06-23 21:43:03,407 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.965e+02 2.510e+02 2.967e+02 3.494e+02 5.351e+02, threshold=5.935e+02, percent-clipped=1.0 2023-06-23 21:43:42,665 INFO [train.py:996] (2/4) Epoch 6, batch 6850, loss[loss=0.2181, simple_loss=0.2853, pruned_loss=0.07546, over 21895.00 frames. ], tot_loss[loss=0.2167, simple_loss=0.2849, pruned_loss=0.07428, over 4278071.01 frames. ], batch size: 316, lr: 5.23e-03, grad_scale: 16.0 2023-06-23 21:43:44,766 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=955938.0, ans=0.0 2023-06-23 21:44:19,151 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=955998.0, ans=0.125 2023-06-23 21:44:21,078 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=955998.0, ans=0.0 2023-06-23 21:44:26,736 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=956058.0, ans=0.04949747468305833 2023-06-23 21:44:51,014 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=956118.0, ans=0.04949747468305833 2023-06-23 21:45:32,168 INFO [train.py:996] (2/4) Epoch 6, batch 6900, loss[loss=0.1894, simple_loss=0.2721, pruned_loss=0.05341, over 21234.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2858, pruned_loss=0.07414, over 4279977.20 frames. ], batch size: 159, lr: 5.23e-03, grad_scale: 16.0 2023-06-23 21:45:36,631 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=956238.0, ans=0.2 2023-06-23 21:46:19,997 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.69 vs. limit=12.0 2023-06-23 21:46:45,106 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 21:46:49,840 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.733e+02 2.526e+02 2.937e+02 3.629e+02 5.523e+02, threshold=5.874e+02, percent-clipped=0.0 2023-06-23 21:46:59,474 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=956478.0, ans=0.0 2023-06-23 21:46:59,523 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=956478.0, ans=0.125 2023-06-23 21:47:27,769 INFO [train.py:996] (2/4) Epoch 6, batch 6950, loss[loss=0.2047, simple_loss=0.2684, pruned_loss=0.07052, over 21272.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2892, pruned_loss=0.0708, over 4282046.08 frames. ], batch size: 143, lr: 5.23e-03, grad_scale: 16.0 2023-06-23 21:48:46,237 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=956778.0, ans=0.2 2023-06-23 21:49:14,670 INFO [train.py:996] (2/4) Epoch 6, batch 7000, loss[loss=0.2143, simple_loss=0.2897, pruned_loss=0.06947, over 15400.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.293, pruned_loss=0.07362, over 4264843.30 frames. ], batch size: 61, lr: 5.23e-03, grad_scale: 16.0 2023-06-23 21:49:15,359 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=956838.0, ans=0.2 2023-06-23 21:49:15,456 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=956838.0, ans=0.125 2023-06-23 21:50:27,289 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.073e+02 2.602e+02 2.936e+02 3.362e+02 6.122e+02, threshold=5.872e+02, percent-clipped=1.0 2023-06-23 21:51:05,576 INFO [train.py:996] (2/4) Epoch 6, batch 7050, loss[loss=0.1903, simple_loss=0.284, pruned_loss=0.04833, over 21833.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.2908, pruned_loss=0.07251, over 4263443.51 frames. ], batch size: 371, lr: 5.23e-03, grad_scale: 16.0 2023-06-23 21:51:06,329 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=957138.0, ans=0.125 2023-06-23 21:51:25,744 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=957198.0, ans=0.1 2023-06-23 21:51:43,491 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=957258.0, ans=0.07 2023-06-23 21:52:06,617 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=957318.0, ans=0.125 2023-06-23 21:52:15,378 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=957318.0, ans=0.05 2023-06-23 21:52:31,234 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.95 vs. limit=15.0 2023-06-23 21:52:39,789 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=957378.0, ans=0.0 2023-06-23 21:52:41,371 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=957378.0, ans=0.125 2023-06-23 21:52:49,897 INFO [train.py:996] (2/4) Epoch 6, batch 7100, loss[loss=0.2897, simple_loss=0.3503, pruned_loss=0.1146, over 21457.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.2962, pruned_loss=0.07563, over 4258140.06 frames. ], batch size: 509, lr: 5.23e-03, grad_scale: 16.0 2023-06-23 21:52:52,534 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=957438.0, ans=0.125 2023-06-23 21:53:09,982 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=957498.0, ans=0.125 2023-06-23 21:53:42,569 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=957558.0, ans=10.0 2023-06-23 21:54:06,832 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.538e+02 2.381e+02 2.673e+02 3.454e+02 5.437e+02, threshold=5.346e+02, percent-clipped=0.0 2023-06-23 21:54:23,732 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=957678.0, ans=0.0 2023-06-23 21:54:32,427 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=957678.0, ans=0.0 2023-06-23 21:54:35,286 INFO [train.py:996] (2/4) Epoch 6, batch 7150, loss[loss=0.2307, simple_loss=0.3247, pruned_loss=0.06838, over 16663.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.2919, pruned_loss=0.07224, over 4249969.12 frames. ], batch size: 60, lr: 5.23e-03, grad_scale: 16.0 2023-06-23 21:54:41,456 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=957738.0, ans=0.125 2023-06-23 21:54:53,682 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 21:55:17,389 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=957798.0, ans=0.125 2023-06-23 21:55:31,756 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.10 vs. limit=22.5 2023-06-23 21:55:33,035 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=957858.0, ans=0.1 2023-06-23 21:55:35,067 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=957858.0, ans=0.0 2023-06-23 21:56:09,633 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=957978.0, ans=0.125 2023-06-23 21:56:24,646 INFO [train.py:996] (2/4) Epoch 6, batch 7200, loss[loss=0.2265, simple_loss=0.2899, pruned_loss=0.08156, over 21416.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.2955, pruned_loss=0.07506, over 4261993.31 frames. ], batch size: 131, lr: 5.23e-03, grad_scale: 32.0 2023-06-23 21:57:06,420 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=958098.0, ans=0.0 2023-06-23 21:57:43,869 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=958218.0, ans=6.0 2023-06-23 21:57:46,113 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.971e+02 2.518e+02 2.883e+02 3.559e+02 6.632e+02, threshold=5.766e+02, percent-clipped=3.0 2023-06-23 21:58:13,494 INFO [train.py:996] (2/4) Epoch 6, batch 7250, loss[loss=0.2252, simple_loss=0.2955, pruned_loss=0.07745, over 21798.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.2912, pruned_loss=0.07459, over 4267092.42 frames. ], batch size: 107, lr: 5.22e-03, grad_scale: 32.0 2023-06-23 21:58:29,314 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=958398.0, ans=0.125 2023-06-23 21:59:37,052 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=958518.0, ans=0.0 2023-06-23 21:59:40,350 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=958518.0, ans=0.125 2023-06-23 22:00:02,000 INFO [train.py:996] (2/4) Epoch 6, batch 7300, loss[loss=0.1885, simple_loss=0.2556, pruned_loss=0.06065, over 21575.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2855, pruned_loss=0.07345, over 4269992.64 frames. ], batch size: 247, lr: 5.22e-03, grad_scale: 32.0 2023-06-23 22:00:46,944 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=958758.0, ans=0.07 2023-06-23 22:01:24,356 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.896e+02 2.461e+02 2.779e+02 3.106e+02 5.760e+02, threshold=5.558e+02, percent-clipped=0.0 2023-06-23 22:01:41,087 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=958878.0, ans=0.025 2023-06-23 22:01:51,246 INFO [train.py:996] (2/4) Epoch 6, batch 7350, loss[loss=0.2732, simple_loss=0.3513, pruned_loss=0.0975, over 21860.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2842, pruned_loss=0.07339, over 4261629.26 frames. ], batch size: 124, lr: 5.22e-03, grad_scale: 16.0 2023-06-23 22:02:33,973 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=958998.0, ans=0.2 2023-06-23 22:02:44,746 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=959058.0, ans=0.2 2023-06-23 22:03:37,936 INFO [train.py:996] (2/4) Epoch 6, batch 7400, loss[loss=0.2184, simple_loss=0.3162, pruned_loss=0.0603, over 21827.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.2899, pruned_loss=0.07558, over 4269952.93 frames. ], batch size: 372, lr: 5.22e-03, grad_scale: 16.0 2023-06-23 22:04:46,133 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=959358.0, ans=0.05 2023-06-23 22:05:00,958 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.952e+02 2.692e+02 3.073e+02 3.719e+02 6.060e+02, threshold=6.147e+02, percent-clipped=2.0 2023-06-23 22:05:01,621 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=959418.0, ans=0.0 2023-06-23 22:05:05,461 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=959418.0, ans=0.1 2023-06-23 22:05:39,086 INFO [train.py:996] (2/4) Epoch 6, batch 7450, loss[loss=0.214, simple_loss=0.2796, pruned_loss=0.0742, over 21624.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.2888, pruned_loss=0.07468, over 4270921.84 frames. ], batch size: 298, lr: 5.22e-03, grad_scale: 16.0 2023-06-23 22:05:43,091 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=959538.0, ans=0.1 2023-06-23 22:05:55,270 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=959598.0, ans=0.125 2023-06-23 22:06:19,309 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.50 vs. limit=15.0 2023-06-23 22:06:38,261 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=959658.0, ans=0.2 2023-06-23 22:06:55,828 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=959718.0, ans=0.125 2023-06-23 22:07:30,964 INFO [train.py:996] (2/4) Epoch 6, batch 7500, loss[loss=0.2423, simple_loss=0.3363, pruned_loss=0.07416, over 21443.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.2934, pruned_loss=0.07652, over 4270623.51 frames. ], batch size: 211, lr: 5.22e-03, grad_scale: 16.0 2023-06-23 22:08:17,655 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.68 vs. limit=15.0 2023-06-23 22:08:44,093 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.946e+02 2.824e+02 3.431e+02 4.118e+02 7.261e+02, threshold=6.863e+02, percent-clipped=3.0 2023-06-23 22:08:44,936 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=960018.0, ans=10.0 2023-06-23 22:09:20,932 INFO [train.py:996] (2/4) Epoch 6, batch 7550, loss[loss=0.1565, simple_loss=0.233, pruned_loss=0.04004, over 15842.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.2985, pruned_loss=0.07489, over 4266324.97 frames. ], batch size: 62, lr: 5.22e-03, grad_scale: 16.0 2023-06-23 22:09:21,741 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=960138.0, ans=0.125 2023-06-23 22:09:45,839 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.35 vs. limit=15.0 2023-06-23 22:10:00,561 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 22:10:20,202 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.71 vs. limit=15.0 2023-06-23 22:10:59,994 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 22:11:08,251 INFO [train.py:996] (2/4) Epoch 6, batch 7600, loss[loss=0.2221, simple_loss=0.286, pruned_loss=0.07908, over 21584.00 frames. ], tot_loss[loss=0.224, simple_loss=0.2992, pruned_loss=0.07437, over 4271234.11 frames. ], batch size: 548, lr: 5.22e-03, grad_scale: 32.0 2023-06-23 22:11:26,154 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=960438.0, ans=0.0 2023-06-23 22:11:36,640 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=960498.0, ans=0.0 2023-06-23 22:11:38,029 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=960498.0, ans=0.0 2023-06-23 22:11:41,722 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=960498.0, ans=0.125 2023-06-23 22:12:11,393 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=960618.0, ans=0.09899494936611666 2023-06-23 22:12:18,811 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.780e+02 2.489e+02 2.806e+02 3.400e+02 5.423e+02, threshold=5.611e+02, percent-clipped=0.0 2023-06-23 22:12:55,293 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=960738.0, ans=0.125 2023-06-23 22:12:56,181 INFO [train.py:996] (2/4) Epoch 6, batch 7650, loss[loss=0.2464, simple_loss=0.3413, pruned_loss=0.0757, over 20087.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.2979, pruned_loss=0.07518, over 4276350.81 frames. ], batch size: 703, lr: 5.22e-03, grad_scale: 32.0 2023-06-23 22:13:08,933 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=960738.0, ans=0.125 2023-06-23 22:13:17,334 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=960738.0, ans=0.125 2023-06-23 22:13:38,141 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=960798.0, ans=10.0 2023-06-23 22:14:04,776 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=960918.0, ans=0.05 2023-06-23 22:14:16,077 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=960918.0, ans=0.0 2023-06-23 22:14:27,111 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=960978.0, ans=0.035 2023-06-23 22:14:35,879 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=960978.0, ans=0.2 2023-06-23 22:14:51,534 INFO [train.py:996] (2/4) Epoch 6, batch 7700, loss[loss=0.2679, simple_loss=0.3383, pruned_loss=0.09878, over 21622.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.302, pruned_loss=0.07855, over 4279192.25 frames. ], batch size: 389, lr: 5.22e-03, grad_scale: 32.0 2023-06-23 22:16:06,419 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.909e+02 2.621e+02 3.080e+02 3.761e+02 5.045e+02, threshold=6.161e+02, percent-clipped=0.0 2023-06-23 22:16:16,277 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=961278.0, ans=0.2 2023-06-23 22:16:43,630 INFO [train.py:996] (2/4) Epoch 6, batch 7750, loss[loss=0.2551, simple_loss=0.3543, pruned_loss=0.07793, over 21762.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3089, pruned_loss=0.07897, over 4277776.60 frames. ], batch size: 282, lr: 5.22e-03, grad_scale: 32.0 2023-06-23 22:17:35,371 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.24 vs. limit=15.0 2023-06-23 22:18:17,542 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=961578.0, ans=0.1 2023-06-23 22:18:26,221 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=961578.0, ans=0.05 2023-06-23 22:18:32,333 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.08 vs. limit=15.0 2023-06-23 22:18:34,458 INFO [train.py:996] (2/4) Epoch 6, batch 7800, loss[loss=0.2316, simple_loss=0.3059, pruned_loss=0.07868, over 21761.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.3135, pruned_loss=0.08088, over 4275136.43 frames. ], batch size: 282, lr: 5.22e-03, grad_scale: 32.0 2023-06-23 22:18:38,391 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=961638.0, ans=0.125 2023-06-23 22:18:50,708 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=961698.0, ans=0.2 2023-06-23 22:18:53,812 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=961698.0, ans=0.125 2023-06-23 22:18:55,536 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 22:19:45,759 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.173e+02 2.845e+02 3.471e+02 4.135e+02 7.731e+02, threshold=6.941e+02, percent-clipped=4.0 2023-06-23 22:20:08,618 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=961878.0, ans=0.025 2023-06-23 22:20:12,004 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=961878.0, ans=0.0 2023-06-23 22:20:21,464 INFO [train.py:996] (2/4) Epoch 6, batch 7850, loss[loss=0.2038, simple_loss=0.2685, pruned_loss=0.06956, over 21616.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3045, pruned_loss=0.07948, over 4263205.78 frames. ], batch size: 298, lr: 5.21e-03, grad_scale: 32.0 2023-06-23 22:20:43,115 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=961998.0, ans=0.125 2023-06-23 22:21:30,362 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=962118.0, ans=0.125 2023-06-23 22:22:15,689 INFO [train.py:996] (2/4) Epoch 6, batch 7900, loss[loss=0.1958, simple_loss=0.2613, pruned_loss=0.06518, over 21143.00 frames. ], tot_loss[loss=0.2276, simple_loss=0.2992, pruned_loss=0.07803, over 4254980.05 frames. ], batch size: 143, lr: 5.21e-03, grad_scale: 32.0 2023-06-23 22:22:58,034 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=962358.0, ans=0.2 2023-06-23 22:23:01,666 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=962358.0, ans=0.2 2023-06-23 22:23:36,793 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.156e+02 2.814e+02 3.173e+02 3.712e+02 6.452e+02, threshold=6.346e+02, percent-clipped=0.0 2023-06-23 22:24:03,344 INFO [train.py:996] (2/4) Epoch 6, batch 7950, loss[loss=0.2628, simple_loss=0.3408, pruned_loss=0.09243, over 21750.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3028, pruned_loss=0.07712, over 4256890.40 frames. ], batch size: 441, lr: 5.21e-03, grad_scale: 32.0 2023-06-23 22:24:07,873 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=962538.0, ans=0.2 2023-06-23 22:24:26,707 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=962598.0, ans=0.125 2023-06-23 22:24:32,147 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=962598.0, ans=0.2 2023-06-23 22:25:12,408 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 22:25:20,125 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=962718.0, ans=0.125 2023-06-23 22:25:20,790 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.36 vs. limit=15.0 2023-06-23 22:25:27,171 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=962718.0, ans=0.125 2023-06-23 22:25:42,473 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 22:25:49,667 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=962778.0, ans=0.0 2023-06-23 22:25:51,499 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_na.min_abs, batch_count=962838.0, ans=0.02 2023-06-23 22:25:58,612 INFO [train.py:996] (2/4) Epoch 6, batch 8000, loss[loss=0.2475, simple_loss=0.3221, pruned_loss=0.08647, over 21445.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3072, pruned_loss=0.07905, over 4250965.10 frames. ], batch size: 194, lr: 5.21e-03, grad_scale: 32.0 2023-06-23 22:27:20,913 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.094e+02 2.660e+02 3.200e+02 3.986e+02 6.358e+02, threshold=6.400e+02, percent-clipped=1.0 2023-06-23 22:27:21,621 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=963018.0, ans=0.125 2023-06-23 22:27:23,445 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=963018.0, ans=0.125 2023-06-23 22:27:49,489 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=963078.0, ans=0.0 2023-06-23 22:27:59,904 INFO [train.py:996] (2/4) Epoch 6, batch 8050, loss[loss=0.2694, simple_loss=0.3566, pruned_loss=0.09114, over 21692.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.3133, pruned_loss=0.07987, over 4258715.89 frames. ], batch size: 389, lr: 5.21e-03, grad_scale: 32.0 2023-06-23 22:28:28,429 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=963198.0, ans=0.125 2023-06-23 22:29:11,632 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=963318.0, ans=0.1 2023-06-23 22:29:48,911 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=963378.0, ans=0.1 2023-06-23 22:29:51,734 INFO [train.py:996] (2/4) Epoch 6, batch 8100, loss[loss=0.2446, simple_loss=0.31, pruned_loss=0.08961, over 21538.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.3107, pruned_loss=0.08035, over 4264046.73 frames. ], batch size: 548, lr: 5.21e-03, grad_scale: 32.0 2023-06-23 22:30:07,771 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=963438.0, ans=0.2 2023-06-23 22:31:23,066 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.199e+02 2.897e+02 3.319e+02 4.086e+02 8.225e+02, threshold=6.637e+02, percent-clipped=1.0 2023-06-23 22:31:58,995 INFO [train.py:996] (2/4) Epoch 6, batch 8150, loss[loss=0.2191, simple_loss=0.3103, pruned_loss=0.06391, over 21597.00 frames. ], tot_loss[loss=0.2427, simple_loss=0.3187, pruned_loss=0.08336, over 4258187.03 frames. ], batch size: 263, lr: 5.21e-03, grad_scale: 16.0 2023-06-23 22:32:25,985 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=963798.0, ans=0.0 2023-06-23 22:32:49,647 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=963858.0, ans=0.1 2023-06-23 22:32:51,424 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=963858.0, ans=0.125 2023-06-23 22:33:26,307 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_positive, batch_count=963978.0, ans=0.05 2023-06-23 22:33:28,165 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=963978.0, ans=0.2 2023-06-23 22:33:29,832 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=963978.0, ans=0.0 2023-06-23 22:33:44,025 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.17 vs. limit=15.0 2023-06-23 22:33:48,098 INFO [train.py:996] (2/4) Epoch 6, batch 8200, loss[loss=0.2108, simple_loss=0.2695, pruned_loss=0.07603, over 21135.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3098, pruned_loss=0.07954, over 4260605.17 frames. ], batch size: 159, lr: 5.21e-03, grad_scale: 16.0 2023-06-23 22:33:51,322 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.30 vs. limit=22.5 2023-06-23 22:34:15,595 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=964098.0, ans=0.2 2023-06-23 22:34:50,485 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=964158.0, ans=0.125 2023-06-23 22:34:55,930 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=964218.0, ans=0.1 2023-06-23 22:35:09,640 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.036e+02 2.474e+02 2.845e+02 3.510e+02 6.334e+02, threshold=5.689e+02, percent-clipped=0.0 2023-06-23 22:35:10,322 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=964218.0, ans=0.2 2023-06-23 22:35:39,838 INFO [train.py:996] (2/4) Epoch 6, batch 8250, loss[loss=0.2003, simple_loss=0.2979, pruned_loss=0.05131, over 19924.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.3088, pruned_loss=0.07969, over 4268444.63 frames. ], batch size: 703, lr: 5.21e-03, grad_scale: 16.0 2023-06-23 22:37:17,245 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=964578.0, ans=0.1 2023-06-23 22:37:21,194 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=964578.0, ans=0.125 2023-06-23 22:37:30,667 INFO [train.py:996] (2/4) Epoch 6, batch 8300, loss[loss=0.2537, simple_loss=0.335, pruned_loss=0.08616, over 21639.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.3073, pruned_loss=0.07664, over 4265998.47 frames. ], batch size: 389, lr: 5.21e-03, grad_scale: 16.0 2023-06-23 22:37:35,572 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.05 vs. limit=15.0 2023-06-23 22:37:52,048 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=964698.0, ans=0.0 2023-06-23 22:38:08,435 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=964698.0, ans=0.125 2023-06-23 22:38:49,169 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.813e+02 2.358e+02 2.866e+02 3.291e+02 6.256e+02, threshold=5.732e+02, percent-clipped=2.0 2023-06-23 22:39:02,340 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=964878.0, ans=0.2 2023-06-23 22:39:04,538 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.01 vs. limit=15.0 2023-06-23 22:39:19,231 INFO [train.py:996] (2/4) Epoch 6, batch 8350, loss[loss=0.2058, simple_loss=0.2873, pruned_loss=0.06219, over 21540.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.3066, pruned_loss=0.07501, over 4272805.94 frames. ], batch size: 212, lr: 5.21e-03, grad_scale: 16.0 2023-06-23 22:40:17,351 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=965058.0, ans=0.1 2023-06-23 22:40:31,444 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=965118.0, ans=0.125 2023-06-23 22:41:08,735 INFO [train.py:996] (2/4) Epoch 6, batch 8400, loss[loss=0.1741, simple_loss=0.2554, pruned_loss=0.04635, over 21199.00 frames. ], tot_loss[loss=0.225, simple_loss=0.3047, pruned_loss=0.07266, over 4273076.23 frames. ], batch size: 143, lr: 5.21e-03, grad_scale: 32.0 2023-06-23 22:41:45,881 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 22:42:16,154 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=965418.0, ans=0.0 2023-06-23 22:42:24,630 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=965418.0, ans=0.125 2023-06-23 22:42:27,776 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.585e+02 2.294e+02 2.573e+02 3.024e+02 4.553e+02, threshold=5.145e+02, percent-clipped=0.0 2023-06-23 22:42:52,958 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=965478.0, ans=0.0 2023-06-23 22:42:55,783 INFO [train.py:996] (2/4) Epoch 6, batch 8450, loss[loss=0.232, simple_loss=0.3019, pruned_loss=0.08101, over 21866.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.3021, pruned_loss=0.07206, over 4277288.60 frames. ], batch size: 118, lr: 5.20e-03, grad_scale: 16.0 2023-06-23 22:43:58,236 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.50 vs. limit=6.0 2023-06-23 22:44:10,554 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.11 vs. limit=15.0 2023-06-23 22:44:17,201 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=965718.0, ans=0.125 2023-06-23 22:44:19,015 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=965718.0, ans=0.125 2023-06-23 22:44:38,621 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=965778.0, ans=0.125 2023-06-23 22:44:44,911 INFO [train.py:996] (2/4) Epoch 6, batch 8500, loss[loss=0.1989, simple_loss=0.2532, pruned_loss=0.07227, over 21265.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.2976, pruned_loss=0.07296, over 4280791.40 frames. ], batch size: 548, lr: 5.20e-03, grad_scale: 16.0 2023-06-23 22:45:52,646 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=965958.0, ans=0.1 2023-06-23 22:45:54,806 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=966018.0, ans=0.0 2023-06-23 22:46:12,573 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=966018.0, ans=0.0 2023-06-23 22:46:13,512 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.818e+02 2.833e+02 3.387e+02 4.039e+02 6.147e+02, threshold=6.774e+02, percent-clipped=7.0 2023-06-23 22:46:36,917 INFO [train.py:996] (2/4) Epoch 6, batch 8550, loss[loss=0.2609, simple_loss=0.3522, pruned_loss=0.08477, over 21766.00 frames. ], tot_loss[loss=0.226, simple_loss=0.3012, pruned_loss=0.07544, over 4273045.98 frames. ], batch size: 351, lr: 5.20e-03, grad_scale: 16.0 2023-06-23 22:47:30,018 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=966258.0, ans=0.125 2023-06-23 22:47:31,793 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=966258.0, ans=0.0 2023-06-23 22:48:22,199 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.69 vs. limit=15.0 2023-06-23 22:48:34,789 INFO [train.py:996] (2/4) Epoch 6, batch 8600, loss[loss=0.2424, simple_loss=0.3198, pruned_loss=0.0825, over 21601.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3092, pruned_loss=0.07763, over 4273102.95 frames. ], batch size: 263, lr: 5.20e-03, grad_scale: 16.0 2023-06-23 22:49:02,944 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=966498.0, ans=0.125 2023-06-23 22:49:53,363 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.89 vs. limit=6.0 2023-06-23 22:49:57,424 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.156e+02 2.875e+02 3.260e+02 4.247e+02 6.190e+02, threshold=6.520e+02, percent-clipped=0.0 2023-06-23 22:49:59,944 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=966618.0, ans=0.1 2023-06-23 22:50:21,012 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.21 vs. limit=22.5 2023-06-23 22:50:31,095 INFO [train.py:996] (2/4) Epoch 6, batch 8650, loss[loss=0.2523, simple_loss=0.3442, pruned_loss=0.08025, over 21537.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3159, pruned_loss=0.07789, over 4278993.85 frames. ], batch size: 471, lr: 5.20e-03, grad_scale: 16.0 2023-06-23 22:51:05,850 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=966798.0, ans=0.1 2023-06-23 22:52:13,222 INFO [train.py:996] (2/4) Epoch 6, batch 8700, loss[loss=0.197, simple_loss=0.2635, pruned_loss=0.06521, over 21647.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.3073, pruned_loss=0.07459, over 4277923.65 frames. ], batch size: 282, lr: 5.20e-03, grad_scale: 16.0 2023-06-23 22:52:41,885 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=967098.0, ans=0.125 2023-06-23 22:53:03,869 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=967158.0, ans=0.0 2023-06-23 22:53:33,446 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.609e+02 2.283e+02 2.590e+02 3.172e+02 4.476e+02, threshold=5.179e+02, percent-clipped=0.0 2023-06-23 22:53:35,996 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=967218.0, ans=0.125 2023-06-23 22:53:47,261 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=967278.0, ans=0.0 2023-06-23 22:54:08,947 INFO [train.py:996] (2/4) Epoch 6, batch 8750, loss[loss=0.2517, simple_loss=0.3596, pruned_loss=0.07184, over 20829.00 frames. ], tot_loss[loss=0.2276, simple_loss=0.3041, pruned_loss=0.07554, over 4279757.19 frames. ], batch size: 608, lr: 5.20e-03, grad_scale: 16.0 2023-06-23 22:54:24,298 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=967338.0, ans=0.2 2023-06-23 22:54:49,113 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=967458.0, ans=0.07 2023-06-23 22:54:54,506 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=967458.0, ans=0.0 2023-06-23 22:55:12,355 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=967518.0, ans=0.0 2023-06-23 22:55:17,694 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 22:55:32,705 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=967518.0, ans=0.0 2023-06-23 22:55:34,900 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=967518.0, ans=0.125 2023-06-23 22:55:34,990 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 22:55:55,858 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.91 vs. limit=10.0 2023-06-23 22:56:02,189 INFO [train.py:996] (2/4) Epoch 6, batch 8800, loss[loss=0.2749, simple_loss=0.3703, pruned_loss=0.08976, over 19945.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3118, pruned_loss=0.07792, over 4285215.40 frames. ], batch size: 702, lr: 5.20e-03, grad_scale: 32.0 2023-06-23 22:56:03,592 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=16.61 vs. limit=22.5 2023-06-23 22:56:04,596 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=967638.0, ans=0.1 2023-06-23 22:56:39,821 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=967698.0, ans=0.125 2023-06-23 22:56:55,943 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.84 vs. limit=12.0 2023-06-23 22:57:13,725 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.46 vs. limit=10.0 2023-06-23 22:57:28,543 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.027e+02 2.723e+02 3.088e+02 3.591e+02 5.183e+02, threshold=6.177e+02, percent-clipped=1.0 2023-06-23 22:57:32,668 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=967878.0, ans=0.2 2023-06-23 22:57:56,317 INFO [train.py:996] (2/4) Epoch 6, batch 8850, loss[loss=0.2403, simple_loss=0.3351, pruned_loss=0.07277, over 21630.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.318, pruned_loss=0.0794, over 4286350.14 frames. ], batch size: 389, lr: 5.20e-03, grad_scale: 32.0 2023-06-23 22:59:46,052 INFO [train.py:996] (2/4) Epoch 6, batch 8900, loss[loss=0.2168, simple_loss=0.277, pruned_loss=0.07829, over 21308.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3126, pruned_loss=0.07863, over 4288922.86 frames. ], batch size: 177, lr: 5.20e-03, grad_scale: 32.0 2023-06-23 22:59:59,576 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=968238.0, ans=0.1 2023-06-23 23:00:23,941 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=968298.0, ans=0.0 2023-06-23 23:01:18,317 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.874e+02 2.656e+02 3.141e+02 3.730e+02 7.900e+02, threshold=6.282e+02, percent-clipped=6.0 2023-06-23 23:01:28,285 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.70 vs. limit=6.0 2023-06-23 23:01:36,589 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=968478.0, ans=0.1 2023-06-23 23:01:39,338 INFO [train.py:996] (2/4) Epoch 6, batch 8950, loss[loss=0.2098, simple_loss=0.2722, pruned_loss=0.07367, over 21395.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3101, pruned_loss=0.07839, over 4277306.14 frames. ], batch size: 194, lr: 5.20e-03, grad_scale: 16.0 2023-06-23 23:03:29,127 INFO [train.py:996] (2/4) Epoch 6, batch 9000, loss[loss=0.1887, simple_loss=0.2513, pruned_loss=0.063, over 21620.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3042, pruned_loss=0.07806, over 4274833.65 frames. ], batch size: 282, lr: 5.20e-03, grad_scale: 16.0 2023-06-23 23:03:29,127 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-23 23:03:39,824 INFO [zipformer.py:1728] (2/4) name=encoder.encoders.2.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([2.0737, 2.5912, 4.1211, 3.1288], device='cuda:2') 2023-06-23 23:03:48,717 INFO [train.py:1028] (2/4) Epoch 6, validation: loss=0.2652, simple_loss=0.3551, pruned_loss=0.08764, over 1796401.00 frames. 2023-06-23 23:03:48,718 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 23731MB 2023-06-23 23:04:07,634 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=968838.0, ans=0.125 2023-06-23 23:04:11,344 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=968898.0, ans=0.125 2023-06-23 23:04:53,663 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=968958.0, ans=0.2 2023-06-23 23:05:12,832 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.595e+02 2.551e+02 3.018e+02 3.495e+02 6.048e+02, threshold=6.037e+02, percent-clipped=0.0 2023-06-23 23:05:17,902 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.80 vs. limit=15.0 2023-06-23 23:05:45,390 INFO [train.py:996] (2/4) Epoch 6, batch 9050, loss[loss=0.2307, simple_loss=0.3107, pruned_loss=0.07531, over 21283.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.3006, pruned_loss=0.07513, over 4278224.11 frames. ], batch size: 549, lr: 5.20e-03, grad_scale: 16.0 2023-06-23 23:06:37,136 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=969258.0, ans=0.1 2023-06-23 23:06:40,408 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=969258.0, ans=0.125 2023-06-23 23:07:36,492 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=969378.0, ans=0.0 2023-06-23 23:07:39,272 INFO [train.py:996] (2/4) Epoch 6, batch 9100, loss[loss=0.2474, simple_loss=0.3392, pruned_loss=0.07777, over 21639.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.308, pruned_loss=0.07744, over 4275665.54 frames. ], batch size: 441, lr: 5.19e-03, grad_scale: 16.0 2023-06-23 23:07:39,994 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=969438.0, ans=10.0 2023-06-23 23:07:54,416 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=969438.0, ans=0.125 2023-06-23 23:08:09,560 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=969498.0, ans=0.0 2023-06-23 23:08:18,418 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=969498.0, ans=0.035 2023-06-23 23:09:04,185 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.712e+02 2.470e+02 2.760e+02 3.335e+02 5.659e+02, threshold=5.519e+02, percent-clipped=0.0 2023-06-23 23:09:30,895 INFO [train.py:996] (2/4) Epoch 6, batch 9150, loss[loss=0.202, simple_loss=0.2878, pruned_loss=0.05815, over 21501.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3119, pruned_loss=0.07588, over 4267804.10 frames. ], batch size: 131, lr: 5.19e-03, grad_scale: 16.0 2023-06-23 23:09:31,670 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=969738.0, ans=0.1 2023-06-23 23:10:09,565 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=969798.0, ans=0.2 2023-06-23 23:11:22,067 INFO [train.py:996] (2/4) Epoch 6, batch 9200, loss[loss=0.2789, simple_loss=0.35, pruned_loss=0.1039, over 21324.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.3121, pruned_loss=0.07436, over 4261536.03 frames. ], batch size: 548, lr: 5.19e-03, grad_scale: 32.0 2023-06-23 23:11:58,357 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=970098.0, ans=0.1 2023-06-23 23:12:05,532 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=970098.0, ans=0.0 2023-06-23 23:12:27,222 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.76 vs. limit=15.0 2023-06-23 23:12:51,025 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.987e+02 2.565e+02 2.927e+02 3.982e+02 7.343e+02, threshold=5.853e+02, percent-clipped=8.0 2023-06-23 23:13:17,990 INFO [train.py:996] (2/4) Epoch 6, batch 9250, loss[loss=0.2302, simple_loss=0.2975, pruned_loss=0.08146, over 21448.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.3149, pruned_loss=0.07817, over 4268854.13 frames. ], batch size: 131, lr: 5.19e-03, grad_scale: 32.0 2023-06-23 23:13:48,561 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=970398.0, ans=0.0 2023-06-23 23:13:52,240 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 23:14:30,703 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=970518.0, ans=0.125 2023-06-23 23:15:15,154 INFO [train.py:996] (2/4) Epoch 6, batch 9300, loss[loss=0.261, simple_loss=0.3254, pruned_loss=0.09828, over 21281.00 frames. ], tot_loss[loss=0.232, simple_loss=0.308, pruned_loss=0.07802, over 4276833.24 frames. ], batch size: 471, lr: 5.19e-03, grad_scale: 32.0 2023-06-23 23:15:32,329 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=970698.0, ans=0.0 2023-06-23 23:15:34,627 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=970698.0, ans=15.0 2023-06-23 23:15:34,751 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.55 vs. limit=12.0 2023-06-23 23:16:33,209 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.072e+02 2.705e+02 3.300e+02 3.579e+02 5.908e+02, threshold=6.601e+02, percent-clipped=1.0 2023-06-23 23:17:06,339 INFO [train.py:996] (2/4) Epoch 6, batch 9350, loss[loss=0.2209, simple_loss=0.2993, pruned_loss=0.07124, over 21900.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3144, pruned_loss=0.07867, over 4280631.17 frames. ], batch size: 98, lr: 5.19e-03, grad_scale: 32.0 2023-06-23 23:17:17,870 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=970938.0, ans=0.0 2023-06-23 23:18:45,722 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=971178.0, ans=0.125 2023-06-23 23:18:57,432 INFO [train.py:996] (2/4) Epoch 6, batch 9400, loss[loss=0.1926, simple_loss=0.2672, pruned_loss=0.05901, over 21664.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3155, pruned_loss=0.07887, over 4282819.78 frames. ], batch size: 282, lr: 5.19e-03, grad_scale: 32.0 2023-06-23 23:19:00,021 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 23:19:33,756 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=971298.0, ans=0.125 2023-06-23 23:20:25,051 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.059e+02 2.477e+02 2.813e+02 3.524e+02 8.030e+02, threshold=5.626e+02, percent-clipped=3.0 2023-06-23 23:20:46,126 INFO [train.py:996] (2/4) Epoch 6, batch 9450, loss[loss=0.2056, simple_loss=0.27, pruned_loss=0.07056, over 21601.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.3075, pruned_loss=0.07797, over 4277004.96 frames. ], batch size: 298, lr: 5.19e-03, grad_scale: 16.0 2023-06-23 23:22:08,969 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 23:22:29,380 INFO [train.py:996] (2/4) Epoch 6, batch 9500, loss[loss=0.2144, simple_loss=0.2755, pruned_loss=0.07662, over 21537.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.2987, pruned_loss=0.07574, over 4276975.73 frames. ], batch size: 414, lr: 5.19e-03, grad_scale: 16.0 2023-06-23 23:22:34,334 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.77 vs. limit=15.0 2023-06-23 23:23:25,097 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.82 vs. limit=22.5 2023-06-23 23:23:26,257 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=971958.0, ans=0.1 2023-06-23 23:23:55,322 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.823e+02 2.481e+02 2.768e+02 3.385e+02 5.932e+02, threshold=5.537e+02, percent-clipped=1.0 2023-06-23 23:24:20,125 INFO [train.py:996] (2/4) Epoch 6, batch 9550, loss[loss=0.2652, simple_loss=0.3624, pruned_loss=0.08399, over 21638.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3032, pruned_loss=0.07771, over 4279302.29 frames. ], batch size: 414, lr: 5.19e-03, grad_scale: 16.0 2023-06-23 23:24:26,295 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.10 vs. limit=15.0 2023-06-23 23:24:58,443 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=972258.0, ans=0.0 2023-06-23 23:25:01,644 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=972258.0, ans=0.1 2023-06-23 23:25:01,678 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=972258.0, ans=0.1 2023-06-23 23:25:53,939 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=972378.0, ans=0.125 2023-06-23 23:26:02,077 INFO [train.py:996] (2/4) Epoch 6, batch 9600, loss[loss=0.2137, simple_loss=0.2896, pruned_loss=0.06892, over 21775.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3066, pruned_loss=0.0789, over 4286356.21 frames. ], batch size: 112, lr: 5.19e-03, grad_scale: 32.0 2023-06-23 23:26:12,959 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=972438.0, ans=0.04949747468305833 2023-06-23 23:26:22,503 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.83 vs. limit=15.0 2023-06-23 23:26:45,534 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=972558.0, ans=0.2 2023-06-23 23:27:14,094 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=972618.0, ans=0.125 2023-06-23 23:27:24,182 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=972618.0, ans=0.1 2023-06-23 23:27:29,933 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=972618.0, ans=0.0 2023-06-23 23:27:32,665 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.956e+02 2.542e+02 2.834e+02 3.285e+02 4.885e+02, threshold=5.668e+02, percent-clipped=0.0 2023-06-23 23:27:37,549 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=972678.0, ans=0.1 2023-06-23 23:28:01,863 INFO [train.py:996] (2/4) Epoch 6, batch 9650, loss[loss=0.1983, simple_loss=0.2721, pruned_loss=0.06223, over 21491.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3089, pruned_loss=0.07885, over 4287235.78 frames. ], batch size: 211, lr: 5.19e-03, grad_scale: 16.0 2023-06-23 23:28:02,227 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=972738.0, ans=0.125 2023-06-23 23:28:12,879 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=972738.0, ans=0.1 2023-06-23 23:28:19,643 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=972798.0, ans=0.125 2023-06-23 23:28:39,269 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=972858.0, ans=0.125 2023-06-23 23:29:03,034 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.15 vs. limit=22.5 2023-06-23 23:29:50,942 INFO [train.py:996] (2/4) Epoch 6, batch 9700, loss[loss=0.2217, simple_loss=0.2978, pruned_loss=0.07275, over 21911.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3116, pruned_loss=0.07927, over 4287076.66 frames. ], batch size: 316, lr: 5.18e-03, grad_scale: 16.0 2023-06-23 23:30:04,950 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=973038.0, ans=0.125 2023-06-23 23:30:22,292 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=973098.0, ans=0.125 2023-06-23 23:30:52,550 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=973218.0, ans=0.09899494936611666 2023-06-23 23:31:09,656 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=973278.0, ans=0.125 2023-06-23 23:31:10,723 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.037e+02 2.422e+02 2.744e+02 3.326e+02 5.586e+02, threshold=5.488e+02, percent-clipped=0.0 2023-06-23 23:31:13,850 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.60 vs. limit=15.0 2023-06-23 23:31:38,370 INFO [train.py:996] (2/4) Epoch 6, batch 9750, loss[loss=0.2584, simple_loss=0.3451, pruned_loss=0.08588, over 21877.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3049, pruned_loss=0.0782, over 4286691.19 frames. ], batch size: 118, lr: 5.18e-03, grad_scale: 16.0 2023-06-23 23:31:39,201 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=973338.0, ans=0.0 2023-06-23 23:32:05,119 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=973398.0, ans=0.125 2023-06-23 23:32:34,631 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=973458.0, ans=0.125 2023-06-23 23:33:02,179 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.77 vs. limit=15.0 2023-06-23 23:33:19,478 INFO [train.py:996] (2/4) Epoch 6, batch 9800, loss[loss=0.2331, simple_loss=0.2977, pruned_loss=0.08429, over 21786.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.3061, pruned_loss=0.0783, over 4287619.45 frames. ], batch size: 441, lr: 5.18e-03, grad_scale: 16.0 2023-06-23 23:33:31,241 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=973638.0, ans=0.2 2023-06-23 23:33:40,659 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.35 vs. limit=15.0 2023-06-23 23:33:46,913 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=973698.0, ans=0.125 2023-06-23 23:34:12,822 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=973758.0, ans=0.125 2023-06-23 23:34:37,227 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=973818.0, ans=0.0 2023-06-23 23:34:42,865 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=973818.0, ans=0.125 2023-06-23 23:34:45,288 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.997e+02 2.591e+02 2.983e+02 3.638e+02 9.651e+02, threshold=5.966e+02, percent-clipped=4.0 2023-06-23 23:35:07,708 INFO [train.py:996] (2/4) Epoch 6, batch 9850, loss[loss=0.2107, simple_loss=0.2763, pruned_loss=0.0725, over 21442.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.3021, pruned_loss=0.07776, over 4292847.95 frames. ], batch size: 131, lr: 5.18e-03, grad_scale: 16.0 2023-06-23 23:35:38,011 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=973998.0, ans=0.0 2023-06-23 23:35:51,824 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=974058.0, ans=0.125 2023-06-23 23:35:52,442 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.48 vs. limit=22.5 2023-06-23 23:36:00,089 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=974058.0, ans=0.125 2023-06-23 23:36:19,772 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=974118.0, ans=0.125 2023-06-23 23:36:23,469 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=974118.0, ans=0.125 2023-06-23 23:36:35,797 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=974178.0, ans=0.0 2023-06-23 23:36:47,102 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=974178.0, ans=0.0 2023-06-23 23:36:57,009 INFO [train.py:996] (2/4) Epoch 6, batch 9900, loss[loss=0.2934, simple_loss=0.3572, pruned_loss=0.1148, over 21381.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.2978, pruned_loss=0.07697, over 4282809.23 frames. ], batch size: 471, lr: 5.18e-03, grad_scale: 16.0 2023-06-23 23:37:01,347 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 23:37:12,326 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=974238.0, ans=0.125 2023-06-23 23:38:23,770 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.968e+02 2.567e+02 2.955e+02 3.451e+02 4.751e+02, threshold=5.911e+02, percent-clipped=0.0 2023-06-23 23:38:32,277 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.81 vs. limit=15.0 2023-06-23 23:38:46,968 INFO [train.py:996] (2/4) Epoch 6, batch 9950, loss[loss=0.2355, simple_loss=0.3086, pruned_loss=0.08121, over 21424.00 frames. ], tot_loss[loss=0.229, simple_loss=0.2992, pruned_loss=0.07942, over 4259726.64 frames. ], batch size: 131, lr: 5.18e-03, grad_scale: 16.0 2023-06-23 23:38:53,551 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=974538.0, ans=0.0 2023-06-23 23:39:27,122 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=974598.0, ans=0.07 2023-06-23 23:39:41,775 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=974658.0, ans=0.125 2023-06-23 23:40:11,786 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=974718.0, ans=0.125 2023-06-23 23:40:11,833 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=974718.0, ans=0.0 2023-06-23 23:40:17,171 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=974778.0, ans=10.0 2023-06-23 23:40:42,662 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=974838.0, ans=0.0 2023-06-23 23:40:43,783 INFO [train.py:996] (2/4) Epoch 6, batch 10000, loss[loss=0.2159, simple_loss=0.2948, pruned_loss=0.06847, over 21762.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.2949, pruned_loss=0.07786, over 4262430.51 frames. ], batch size: 352, lr: 5.18e-03, grad_scale: 32.0 2023-06-23 23:41:38,242 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=974958.0, ans=0.2 2023-06-23 23:42:01,840 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.98 vs. limit=15.0 2023-06-23 23:42:10,939 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.070e+02 2.477e+02 2.945e+02 3.555e+02 6.332e+02, threshold=5.891e+02, percent-clipped=1.0 2023-06-23 23:42:34,454 INFO [train.py:996] (2/4) Epoch 6, batch 10050, loss[loss=0.1933, simple_loss=0.2643, pruned_loss=0.06112, over 21717.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.297, pruned_loss=0.07826, over 4264590.86 frames. ], batch size: 282, lr: 5.18e-03, grad_scale: 32.0 2023-06-23 23:42:42,615 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=975138.0, ans=0.2 2023-06-23 23:43:19,222 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.38 vs. limit=15.0 2023-06-23 23:43:36,657 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=975258.0, ans=0.125 2023-06-23 23:44:16,276 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.65 vs. limit=10.0 2023-06-23 23:44:25,437 INFO [train.py:996] (2/4) Epoch 6, batch 10100, loss[loss=0.2285, simple_loss=0.2971, pruned_loss=0.07989, over 21609.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.2945, pruned_loss=0.07647, over 4258583.48 frames. ], batch size: 230, lr: 5.18e-03, grad_scale: 32.0 2023-06-23 23:44:42,916 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.97 vs. limit=15.0 2023-06-23 23:44:58,858 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=975498.0, ans=0.125 2023-06-23 23:45:56,853 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=975678.0, ans=0.2 2023-06-23 23:45:59,922 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.054e+02 2.533e+02 2.969e+02 3.783e+02 6.881e+02, threshold=5.937e+02, percent-clipped=1.0 2023-06-23 23:46:02,916 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.03 vs. limit=15.0 2023-06-23 23:46:21,400 INFO [train.py:996] (2/4) Epoch 6, batch 10150, loss[loss=0.2267, simple_loss=0.2994, pruned_loss=0.07697, over 21655.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.3019, pruned_loss=0.07911, over 4263474.51 frames. ], batch size: 332, lr: 5.18e-03, grad_scale: 16.0 2023-06-23 23:46:25,926 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.57 vs. limit=22.5 2023-06-23 23:46:49,784 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.83 vs. limit=15.0 2023-06-23 23:47:11,933 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=975858.0, ans=0.2 2023-06-23 23:48:09,608 INFO [train.py:996] (2/4) Epoch 6, batch 10200, loss[loss=0.2164, simple_loss=0.295, pruned_loss=0.06887, over 21001.00 frames. ], tot_loss[loss=0.2266, simple_loss=0.3002, pruned_loss=0.07647, over 4266831.00 frames. ], batch size: 607, lr: 5.18e-03, grad_scale: 16.0 2023-06-23 23:48:10,287 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=976038.0, ans=0.125 2023-06-23 23:48:15,379 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=976038.0, ans=0.125 2023-06-23 23:48:27,113 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=976038.0, ans=0.125 2023-06-23 23:48:40,755 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=976098.0, ans=0.0 2023-06-23 23:49:33,797 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=976218.0, ans=0.04949747468305833 2023-06-23 23:49:38,142 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.688e+02 2.173e+02 2.583e+02 3.025e+02 4.269e+02, threshold=5.166e+02, percent-clipped=0.0 2023-06-23 23:49:59,510 INFO [train.py:996] (2/4) Epoch 6, batch 10250, loss[loss=0.1936, simple_loss=0.284, pruned_loss=0.05163, over 21378.00 frames. ], tot_loss[loss=0.2192, simple_loss=0.2956, pruned_loss=0.07134, over 4257419.94 frames. ], batch size: 211, lr: 5.18e-03, grad_scale: 16.0 2023-06-23 23:51:08,793 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=976518.0, ans=0.125 2023-06-23 23:51:41,673 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=976578.0, ans=0.2 2023-06-23 23:51:58,305 INFO [train.py:996] (2/4) Epoch 6, batch 10300, loss[loss=0.2477, simple_loss=0.3328, pruned_loss=0.08131, over 21426.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.2996, pruned_loss=0.07301, over 4267384.18 frames. ], batch size: 194, lr: 5.18e-03, grad_scale: 16.0 2023-06-23 23:52:11,631 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=976638.0, ans=0.125 2023-06-23 23:53:01,202 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=976758.0, ans=0.0 2023-06-23 23:53:28,851 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.542e+02 2.521e+02 2.843e+02 3.478e+02 5.751e+02, threshold=5.686e+02, percent-clipped=3.0 2023-06-23 23:53:52,283 INFO [train.py:996] (2/4) Epoch 6, batch 10350, loss[loss=0.2148, simple_loss=0.3002, pruned_loss=0.0647, over 21838.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.2995, pruned_loss=0.07284, over 4270947.99 frames. ], batch size: 372, lr: 5.17e-03, grad_scale: 16.0 2023-06-23 23:55:43,948 INFO [train.py:996] (2/4) Epoch 6, batch 10400, loss[loss=0.1596, simple_loss=0.209, pruned_loss=0.05507, over 21107.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.2929, pruned_loss=0.07135, over 4271711.24 frames. ], batch size: 143, lr: 5.17e-03, grad_scale: 32.0 2023-06-23 23:55:59,008 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=977238.0, ans=0.0 2023-06-23 23:56:55,088 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=977418.0, ans=0.0 2023-06-23 23:57:20,294 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.029e+02 2.786e+02 3.233e+02 3.708e+02 5.830e+02, threshold=6.465e+02, percent-clipped=3.0 2023-06-23 23:57:41,136 INFO [train.py:996] (2/4) Epoch 6, batch 10450, loss[loss=0.2394, simple_loss=0.3217, pruned_loss=0.07856, over 21825.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.2974, pruned_loss=0.07405, over 4262513.52 frames. ], batch size: 316, lr: 5.17e-03, grad_scale: 32.0 2023-06-23 23:58:03,567 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=977598.0, ans=0.1 2023-06-23 23:58:30,355 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=977658.0, ans=0.09899494936611666 2023-06-23 23:58:50,315 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=977718.0, ans=0.0 2023-06-23 23:59:30,754 INFO [train.py:996] (2/4) Epoch 6, batch 10500, loss[loss=0.2232, simple_loss=0.2973, pruned_loss=0.07453, over 21811.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.2962, pruned_loss=0.07284, over 4254879.75 frames. ], batch size: 98, lr: 5.17e-03, grad_scale: 16.0 2023-06-24 00:00:08,070 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=977898.0, ans=0.0 2023-06-24 00:00:59,805 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.841e+02 2.398e+02 2.689e+02 3.123e+02 4.066e+02, threshold=5.379e+02, percent-clipped=0.0 2023-06-24 00:01:19,049 INFO [train.py:996] (2/4) Epoch 6, batch 10550, loss[loss=0.2075, simple_loss=0.2703, pruned_loss=0.07237, over 21662.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.2921, pruned_loss=0.07289, over 4250192.67 frames. ], batch size: 333, lr: 5.17e-03, grad_scale: 16.0 2023-06-24 00:01:36,009 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=978138.0, ans=0.0 2023-06-24 00:02:32,321 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=978318.0, ans=0.125 2023-06-24 00:02:59,178 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=978378.0, ans=0.0 2023-06-24 00:03:09,242 INFO [train.py:996] (2/4) Epoch 6, batch 10600, loss[loss=0.197, simple_loss=0.2716, pruned_loss=0.06122, over 21737.00 frames. ], tot_loss[loss=0.2147, simple_loss=0.2872, pruned_loss=0.07111, over 4244613.28 frames. ], batch size: 316, lr: 5.17e-03, grad_scale: 16.0 2023-06-24 00:03:58,041 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=978558.0, ans=0.5 2023-06-24 00:04:19,629 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=978618.0, ans=0.0 2023-06-24 00:04:22,029 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.74 vs. limit=15.0 2023-06-24 00:04:47,270 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.753e+02 2.546e+02 2.981e+02 3.597e+02 7.487e+02, threshold=5.961e+02, percent-clipped=2.0 2023-06-24 00:05:04,474 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=978678.0, ans=0.2 2023-06-24 00:05:12,644 INFO [train.py:996] (2/4) Epoch 6, batch 10650, loss[loss=0.181, simple_loss=0.2644, pruned_loss=0.04877, over 21804.00 frames. ], tot_loss[loss=0.2142, simple_loss=0.2894, pruned_loss=0.06956, over 4255971.64 frames. ], batch size: 317, lr: 5.17e-03, grad_scale: 16.0 2023-06-24 00:05:14,867 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=978738.0, ans=0.125 2023-06-24 00:05:32,436 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=978798.0, ans=0.0 2023-06-24 00:06:00,756 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=978858.0, ans=0.1 2023-06-24 00:06:02,203 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=978858.0, ans=0.125 2023-06-24 00:06:22,368 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.07 vs. limit=15.0 2023-06-24 00:06:33,388 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.12 vs. limit=15.0 2023-06-24 00:07:03,062 INFO [train.py:996] (2/4) Epoch 6, batch 10700, loss[loss=0.2722, simple_loss=0.3374, pruned_loss=0.1035, over 21698.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.2887, pruned_loss=0.06949, over 4256547.43 frames. ], batch size: 441, lr: 5.17e-03, grad_scale: 16.0 2023-06-24 00:07:03,687 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=979038.0, ans=0.125 2023-06-24 00:07:12,793 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=979038.0, ans=0.2 2023-06-24 00:07:21,920 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=979098.0, ans=0.0 2023-06-24 00:07:50,291 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=979158.0, ans=0.95 2023-06-24 00:08:15,947 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=979218.0, ans=0.125 2023-06-24 00:08:33,479 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.00 vs. limit=6.0 2023-06-24 00:08:35,823 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.934e+02 2.562e+02 2.930e+02 3.343e+02 5.418e+02, threshold=5.860e+02, percent-clipped=0.0 2023-06-24 00:08:55,541 INFO [train.py:996] (2/4) Epoch 6, batch 10750, loss[loss=0.2522, simple_loss=0.3486, pruned_loss=0.07793, over 21407.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.2999, pruned_loss=0.07442, over 4255852.92 frames. ], batch size: 211, lr: 5.17e-03, grad_scale: 16.0 2023-06-24 00:08:56,393 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=979338.0, ans=0.125 2023-06-24 00:08:56,916 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.79 vs. limit=15.0 2023-06-24 00:09:50,786 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.80 vs. limit=22.5 2023-06-24 00:10:32,394 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=979578.0, ans=0.125 2023-06-24 00:10:34,602 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.29 vs. limit=15.0 2023-06-24 00:10:47,833 INFO [train.py:996] (2/4) Epoch 6, batch 10800, loss[loss=0.2739, simple_loss=0.3916, pruned_loss=0.07811, over 19848.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.3041, pruned_loss=0.07522, over 4252356.86 frames. ], batch size: 702, lr: 5.17e-03, grad_scale: 32.0 2023-06-24 00:11:23,411 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=979698.0, ans=0.05 2023-06-24 00:12:24,838 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.018e+02 2.761e+02 3.249e+02 3.882e+02 5.958e+02, threshold=6.498e+02, percent-clipped=1.0 2023-06-24 00:12:34,021 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=979878.0, ans=0.125 2023-06-24 00:12:44,106 INFO [train.py:996] (2/4) Epoch 6, batch 10850, loss[loss=0.188, simple_loss=0.2548, pruned_loss=0.0606, over 20700.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.307, pruned_loss=0.07613, over 4255394.81 frames. ], batch size: 608, lr: 5.17e-03, grad_scale: 32.0 2023-06-24 00:12:50,075 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=979938.0, ans=0.125 2023-06-24 00:12:53,654 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=979938.0, ans=0.0 2023-06-24 00:13:41,718 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.21 vs. limit=10.0 2023-06-24 00:13:53,350 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=980118.0, ans=0.1 2023-06-24 00:13:55,848 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.77 vs. limit=15.0 2023-06-24 00:14:07,632 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=980118.0, ans=0.125 2023-06-24 00:14:09,917 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.28 vs. limit=22.5 2023-06-24 00:14:35,111 INFO [train.py:996] (2/4) Epoch 6, batch 10900, loss[loss=0.1982, simple_loss=0.2697, pruned_loss=0.06338, over 21212.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.3, pruned_loss=0.07457, over 4253417.20 frames. ], batch size: 143, lr: 5.17e-03, grad_scale: 16.0 2023-06-24 00:15:40,087 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=980358.0, ans=0.2 2023-06-24 00:16:05,809 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.122e+02 2.411e+02 2.776e+02 2.994e+02 5.292e+02, threshold=5.553e+02, percent-clipped=0.0 2023-06-24 00:16:08,128 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=980478.0, ans=0.125 2023-06-24 00:16:22,916 INFO [train.py:996] (2/4) Epoch 6, batch 10950, loss[loss=0.1781, simple_loss=0.2497, pruned_loss=0.05318, over 21618.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.2937, pruned_loss=0.07206, over 4247972.27 frames. ], batch size: 298, lr: 5.16e-03, grad_scale: 16.0 2023-06-24 00:16:49,402 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=980598.0, ans=0.125 2023-06-24 00:18:13,161 INFO [train.py:996] (2/4) Epoch 6, batch 11000, loss[loss=0.2282, simple_loss=0.2914, pruned_loss=0.08247, over 21534.00 frames. ], tot_loss[loss=0.2183, simple_loss=0.2923, pruned_loss=0.07214, over 4253687.28 frames. ], batch size: 195, lr: 5.16e-03, grad_scale: 16.0 2023-06-24 00:19:26,018 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=981018.0, ans=0.0 2023-06-24 00:19:44,369 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=981078.0, ans=0.04949747468305833 2023-06-24 00:19:45,489 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.830e+02 2.423e+02 2.754e+02 3.301e+02 6.173e+02, threshold=5.508e+02, percent-clipped=2.0 2023-06-24 00:19:53,752 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=981078.0, ans=0.0 2023-06-24 00:19:58,274 INFO [train.py:996] (2/4) Epoch 6, batch 11050, loss[loss=0.2172, simple_loss=0.2746, pruned_loss=0.07994, over 21673.00 frames. ], tot_loss[loss=0.2183, simple_loss=0.2898, pruned_loss=0.07339, over 4260608.56 frames. ], batch size: 393, lr: 5.16e-03, grad_scale: 16.0 2023-06-24 00:20:32,316 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.73 vs. limit=15.0 2023-06-24 00:20:49,506 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=981258.0, ans=0.2 2023-06-24 00:20:51,736 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.11 vs. limit=6.0 2023-06-24 00:21:12,378 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.13 vs. limit=10.0 2023-06-24 00:21:16,749 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=981318.0, ans=0.125 2023-06-24 00:21:45,968 INFO [train.py:996] (2/4) Epoch 6, batch 11100, loss[loss=0.2097, simple_loss=0.2782, pruned_loss=0.0706, over 21503.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.2892, pruned_loss=0.07357, over 4251156.21 frames. ], batch size: 195, lr: 5.16e-03, grad_scale: 16.0 2023-06-24 00:23:23,906 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.037e+02 2.487e+02 2.801e+02 3.244e+02 5.802e+02, threshold=5.603e+02, percent-clipped=1.0 2023-06-24 00:23:36,117 INFO [train.py:996] (2/4) Epoch 6, batch 11150, loss[loss=0.2473, simple_loss=0.3342, pruned_loss=0.08017, over 20656.00 frames. ], tot_loss[loss=0.2181, simple_loss=0.2885, pruned_loss=0.07388, over 4258760.13 frames. ], batch size: 607, lr: 5.16e-03, grad_scale: 16.0 2023-06-24 00:23:37,093 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=981738.0, ans=0.2 2023-06-24 00:24:13,547 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=981798.0, ans=0.0 2023-06-24 00:25:16,123 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.19 vs. limit=22.5 2023-06-24 00:25:24,326 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=981978.0, ans=0.0 2023-06-24 00:25:27,184 INFO [train.py:996] (2/4) Epoch 6, batch 11200, loss[loss=0.2136, simple_loss=0.2768, pruned_loss=0.07515, over 21541.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.2878, pruned_loss=0.07345, over 4267268.16 frames. ], batch size: 414, lr: 5.16e-03, grad_scale: 32.0 2023-06-24 00:26:09,532 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=982098.0, ans=0.0 2023-06-24 00:26:12,603 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=982098.0, ans=0.1 2023-06-24 00:26:14,682 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=982098.0, ans=0.2 2023-06-24 00:26:44,855 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=982218.0, ans=0.0 2023-06-24 00:27:03,171 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.056e+02 2.434e+02 2.676e+02 2.972e+02 5.122e+02, threshold=5.353e+02, percent-clipped=0.0 2023-06-24 00:27:15,151 INFO [train.py:996] (2/4) Epoch 6, batch 11250, loss[loss=0.2225, simple_loss=0.2894, pruned_loss=0.07782, over 20167.00 frames. ], tot_loss[loss=0.2167, simple_loss=0.2871, pruned_loss=0.07315, over 4266156.25 frames. ], batch size: 702, lr: 5.16e-03, grad_scale: 32.0 2023-06-24 00:27:45,234 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=982398.0, ans=0.125 2023-06-24 00:29:03,629 INFO [train.py:996] (2/4) Epoch 6, batch 11300, loss[loss=0.2076, simple_loss=0.2822, pruned_loss=0.06653, over 21918.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.2892, pruned_loss=0.0736, over 4271016.10 frames. ], batch size: 316, lr: 5.16e-03, grad_scale: 16.0 2023-06-24 00:29:54,712 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.76 vs. limit=6.0 2023-06-24 00:29:57,700 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=982758.0, ans=0.125 2023-06-24 00:30:02,575 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=982758.0, ans=0.125 2023-06-24 00:30:04,889 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.81 vs. limit=15.0 2023-06-24 00:30:17,183 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.49 vs. limit=15.0 2023-06-24 00:30:25,920 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.07 vs. limit=15.0 2023-06-24 00:30:28,771 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=982818.0, ans=0.125 2023-06-24 00:30:36,114 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.52 vs. limit=15.0 2023-06-24 00:30:37,674 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=982878.0, ans=0.0 2023-06-24 00:30:43,981 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.722e+02 2.481e+02 2.716e+02 3.096e+02 3.979e+02, threshold=5.433e+02, percent-clipped=0.0 2023-06-24 00:31:01,016 INFO [train.py:996] (2/4) Epoch 6, batch 11350, loss[loss=0.1871, simple_loss=0.2694, pruned_loss=0.05241, over 21457.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.2903, pruned_loss=0.07268, over 4268196.25 frames. ], batch size: 195, lr: 5.16e-03, grad_scale: 16.0 2023-06-24 00:31:10,708 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=982938.0, ans=0.125 2023-06-24 00:31:31,784 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.14 vs. limit=6.0 2023-06-24 00:32:24,155 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=983118.0, ans=0.125 2023-06-24 00:32:36,572 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=983178.0, ans=0.0 2023-06-24 00:32:44,258 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=983178.0, ans=0.1 2023-06-24 00:32:58,567 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=983238.0, ans=0.125 2023-06-24 00:32:59,515 INFO [train.py:996] (2/4) Epoch 6, batch 11400, loss[loss=0.2168, simple_loss=0.2948, pruned_loss=0.06938, over 21359.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.2971, pruned_loss=0.07606, over 4269911.14 frames. ], batch size: 194, lr: 5.16e-03, grad_scale: 16.0 2023-06-24 00:33:32,053 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=983298.0, ans=0.125 2023-06-24 00:34:20,197 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.12 vs. limit=12.0 2023-06-24 00:34:38,481 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.09 vs. limit=6.0 2023-06-24 00:34:38,929 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.986e+02 2.559e+02 2.841e+02 3.332e+02 5.224e+02, threshold=5.682e+02, percent-clipped=0.0 2023-06-24 00:34:49,731 INFO [train.py:996] (2/4) Epoch 6, batch 11450, loss[loss=0.2228, simple_loss=0.3001, pruned_loss=0.07275, over 21585.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.298, pruned_loss=0.07464, over 4271486.24 frames. ], batch size: 263, lr: 5.16e-03, grad_scale: 16.0 2023-06-24 00:35:08,335 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=983538.0, ans=0.0 2023-06-24 00:35:20,486 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=983598.0, ans=0.125 2023-06-24 00:35:22,672 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=983598.0, ans=0.0 2023-06-24 00:36:46,013 INFO [train.py:996] (2/4) Epoch 6, batch 11500, loss[loss=0.2468, simple_loss=0.3297, pruned_loss=0.08197, over 21755.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.3014, pruned_loss=0.0761, over 4276571.39 frames. ], batch size: 124, lr: 5.16e-03, grad_scale: 16.0 2023-06-24 00:37:06,338 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=983898.0, ans=0.1 2023-06-24 00:37:39,487 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.19 vs. limit=15.0 2023-06-24 00:38:29,588 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.054e+02 2.699e+02 3.055e+02 3.965e+02 5.631e+02, threshold=6.111e+02, percent-clipped=0.0 2023-06-24 00:38:41,229 INFO [train.py:996] (2/4) Epoch 6, batch 11550, loss[loss=0.3957, simple_loss=0.4765, pruned_loss=0.1575, over 21489.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.309, pruned_loss=0.0768, over 4276883.37 frames. ], batch size: 507, lr: 5.16e-03, grad_scale: 16.0 2023-06-24 00:39:00,317 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=984138.0, ans=0.125 2023-06-24 00:39:27,662 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 00:40:26,924 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 00:40:38,832 INFO [train.py:996] (2/4) Epoch 6, batch 11600, loss[loss=0.2456, simple_loss=0.3396, pruned_loss=0.07575, over 21343.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3218, pruned_loss=0.07826, over 4271786.49 frames. ], batch size: 131, lr: 5.15e-03, grad_scale: 32.0 2023-06-24 00:40:54,335 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.out_whiten.whitening_limit, batch_count=984438.0, ans=8.0 2023-06-24 00:40:56,150 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.88 vs. limit=15.0 2023-06-24 00:41:09,626 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=984498.0, ans=0.1 2023-06-24 00:41:39,841 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=984558.0, ans=0.125 2023-06-24 00:41:43,660 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=984618.0, ans=0.1 2023-06-24 00:41:59,682 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=984618.0, ans=0.125 2023-06-24 00:42:15,285 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.062e+02 2.879e+02 3.402e+02 4.224e+02 8.565e+02, threshold=6.804e+02, percent-clipped=5.0 2023-06-24 00:42:28,808 INFO [train.py:996] (2/4) Epoch 6, batch 11650, loss[loss=0.2454, simple_loss=0.3271, pruned_loss=0.08186, over 21448.00 frames. ], tot_loss[loss=0.2434, simple_loss=0.3285, pruned_loss=0.07917, over 4270908.34 frames. ], batch size: 211, lr: 5.15e-03, grad_scale: 16.0 2023-06-24 00:43:49,743 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=984978.0, ans=0.0 2023-06-24 00:44:08,399 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.96 vs. limit=15.0 2023-06-24 00:44:09,997 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.12 vs. limit=15.0 2023-06-24 00:44:12,084 INFO [train.py:996] (2/4) Epoch 6, batch 11700, loss[loss=0.2256, simple_loss=0.2814, pruned_loss=0.08491, over 21881.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3192, pruned_loss=0.07849, over 4271322.84 frames. ], batch size: 373, lr: 5.15e-03, grad_scale: 16.0 2023-06-24 00:44:26,646 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=985038.0, ans=0.125 2023-06-24 00:44:37,224 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=985098.0, ans=0.0 2023-06-24 00:45:25,159 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.46 vs. limit=22.5 2023-06-24 00:45:52,492 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.098e+02 2.525e+02 2.747e+02 3.370e+02 5.066e+02, threshold=5.494e+02, percent-clipped=0.0 2023-06-24 00:46:01,433 INFO [train.py:996] (2/4) Epoch 6, batch 11750, loss[loss=0.2807, simple_loss=0.3301, pruned_loss=0.1156, over 21367.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3106, pruned_loss=0.07748, over 4257301.91 frames. ], batch size: 471, lr: 5.15e-03, grad_scale: 16.0 2023-06-24 00:47:52,531 INFO [train.py:996] (2/4) Epoch 6, batch 11800, loss[loss=0.2542, simple_loss=0.3289, pruned_loss=0.08971, over 21504.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.311, pruned_loss=0.07895, over 4258798.94 frames. ], batch size: 131, lr: 5.15e-03, grad_scale: 16.0 2023-06-24 00:48:33,378 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.69 vs. limit=12.0 2023-06-24 00:49:31,797 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=985878.0, ans=0.125 2023-06-24 00:49:34,922 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.881e+02 2.469e+02 2.710e+02 3.084e+02 4.949e+02, threshold=5.420e+02, percent-clipped=0.0 2023-06-24 00:49:43,742 INFO [train.py:996] (2/4) Epoch 6, batch 11850, loss[loss=0.2134, simple_loss=0.3117, pruned_loss=0.05753, over 21818.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3129, pruned_loss=0.07815, over 4259325.47 frames. ], batch size: 371, lr: 5.15e-03, grad_scale: 16.0 2023-06-24 00:50:03,958 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.20 vs. limit=15.0 2023-06-24 00:50:49,791 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.22 vs. limit=15.0 2023-06-24 00:51:12,041 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.83 vs. limit=6.0 2023-06-24 00:51:34,360 INFO [train.py:996] (2/4) Epoch 6, batch 11900, loss[loss=0.2446, simple_loss=0.3415, pruned_loss=0.07384, over 19726.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3146, pruned_loss=0.07609, over 4258106.51 frames. ], batch size: 702, lr: 5.15e-03, grad_scale: 16.0 2023-06-24 00:51:51,935 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=986238.0, ans=0.025 2023-06-24 00:52:11,084 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=986298.0, ans=0.125 2023-06-24 00:52:27,899 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.50 vs. limit=15.0 2023-06-24 00:53:04,228 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.65 vs. limit=22.5 2023-06-24 00:53:14,258 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.40 vs. limit=15.0 2023-06-24 00:53:16,518 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.796e+02 2.327e+02 2.667e+02 3.121e+02 4.121e+02, threshold=5.333e+02, percent-clipped=0.0 2023-06-24 00:53:31,113 INFO [train.py:996] (2/4) Epoch 6, batch 11950, loss[loss=0.2317, simple_loss=0.3324, pruned_loss=0.06555, over 21668.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.3125, pruned_loss=0.07252, over 4262468.76 frames. ], batch size: 247, lr: 5.15e-03, grad_scale: 16.0 2023-06-24 00:54:18,786 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=986658.0, ans=0.1 2023-06-24 00:54:55,595 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 00:54:56,206 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.98 vs. limit=15.0 2023-06-24 00:55:19,972 INFO [train.py:996] (2/4) Epoch 6, batch 12000, loss[loss=0.2083, simple_loss=0.2715, pruned_loss=0.07259, over 21570.00 frames. ], tot_loss[loss=0.225, simple_loss=0.3078, pruned_loss=0.07113, over 4255638.53 frames. ], batch size: 263, lr: 5.15e-03, grad_scale: 32.0 2023-06-24 00:55:19,972 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-24 00:55:44,729 INFO [train.py:1028] (2/4) Epoch 6, validation: loss=0.2624, simple_loss=0.3526, pruned_loss=0.08607, over 1796401.00 frames. 2023-06-24 00:55:44,730 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 23731MB 2023-06-24 00:55:57,601 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 00:56:03,968 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=986898.0, ans=0.125 2023-06-24 00:56:05,844 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=986898.0, ans=0.125 2023-06-24 00:57:13,634 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.670e+02 2.572e+02 3.062e+02 3.583e+02 6.186e+02, threshold=6.124e+02, percent-clipped=1.0 2023-06-24 00:57:27,292 INFO [train.py:996] (2/4) Epoch 6, batch 12050, loss[loss=0.2252, simple_loss=0.2932, pruned_loss=0.07864, over 21646.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.3045, pruned_loss=0.0733, over 4251723.74 frames. ], batch size: 195, lr: 5.15e-03, grad_scale: 32.0 2023-06-24 00:57:46,226 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=987138.0, ans=0.125 2023-06-24 00:58:34,207 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=987318.0, ans=0.0 2023-06-24 00:58:39,183 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=987318.0, ans=0.125 2023-06-24 00:59:24,251 INFO [train.py:996] (2/4) Epoch 6, batch 12100, loss[loss=0.2773, simple_loss=0.3595, pruned_loss=0.09754, over 21369.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.3081, pruned_loss=0.07738, over 4258523.11 frames. ], batch size: 131, lr: 5.15e-03, grad_scale: 16.0 2023-06-24 00:59:59,934 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=987498.0, ans=0.07 2023-06-24 01:00:00,444 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.18 vs. limit=12.0 2023-06-24 01:00:34,451 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=987618.0, ans=0.5 2023-06-24 01:01:06,905 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.880e+02 2.682e+02 3.113e+02 3.706e+02 5.999e+02, threshold=6.227e+02, percent-clipped=0.0 2023-06-24 01:01:14,020 INFO [train.py:996] (2/4) Epoch 6, batch 12150, loss[loss=0.2267, simple_loss=0.3168, pruned_loss=0.0683, over 21852.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3114, pruned_loss=0.07688, over 4254861.51 frames. ], batch size: 316, lr: 5.15e-03, grad_scale: 16.0 2023-06-24 01:01:14,927 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=987738.0, ans=0.2 2023-06-24 01:01:14,929 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=987738.0, ans=0.0 2023-06-24 01:01:15,515 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.39 vs. limit=15.0 2023-06-24 01:01:55,725 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 01:02:15,649 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=987858.0, ans=0.1 2023-06-24 01:02:17,347 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=987858.0, ans=0.125 2023-06-24 01:03:08,588 INFO [train.py:996] (2/4) Epoch 6, batch 12200, loss[loss=0.2652, simple_loss=0.3073, pruned_loss=0.1116, over 21353.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.3074, pruned_loss=0.07665, over 4260300.67 frames. ], batch size: 508, lr: 5.15e-03, grad_scale: 16.0 2023-06-24 01:03:35,775 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=988098.0, ans=0.2 2023-06-24 01:03:46,685 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=988158.0, ans=0.125 2023-06-24 01:03:53,786 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=988158.0, ans=0.125 2023-06-24 01:04:03,672 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.99 vs. limit=15.0 2023-06-24 01:04:45,488 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.529e+02 2.375e+02 2.667e+02 3.386e+02 5.475e+02, threshold=5.334e+02, percent-clipped=0.0 2023-06-24 01:04:57,291 INFO [train.py:996] (2/4) Epoch 6, batch 12250, loss[loss=0.1552, simple_loss=0.2281, pruned_loss=0.04109, over 21757.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.2992, pruned_loss=0.07357, over 4265622.86 frames. ], batch size: 112, lr: 5.14e-03, grad_scale: 16.0 2023-06-24 01:05:12,333 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.30 vs. limit=15.0 2023-06-24 01:05:20,269 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=988398.0, ans=0.0 2023-06-24 01:06:33,209 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=988578.0, ans=0.125 2023-06-24 01:06:40,995 INFO [train.py:996] (2/4) Epoch 6, batch 12300, loss[loss=0.1708, simple_loss=0.2442, pruned_loss=0.04871, over 21153.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2913, pruned_loss=0.06778, over 4266571.24 frames. ], batch size: 143, lr: 5.14e-03, grad_scale: 8.0 2023-06-24 01:07:49,196 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=988818.0, ans=15.0 2023-06-24 01:08:25,519 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.495e+02 2.150e+02 2.660e+02 3.179e+02 5.593e+02, threshold=5.319e+02, percent-clipped=1.0 2023-06-24 01:08:36,026 INFO [train.py:996] (2/4) Epoch 6, batch 12350, loss[loss=0.2601, simple_loss=0.3377, pruned_loss=0.09121, over 21720.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.2982, pruned_loss=0.06947, over 4275489.01 frames. ], batch size: 389, lr: 5.14e-03, grad_scale: 8.0 2023-06-24 01:08:48,724 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=988938.0, ans=10.0 2023-06-24 01:09:24,675 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.27 vs. limit=12.0 2023-06-24 01:09:49,349 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=989118.0, ans=0.1 2023-06-24 01:10:24,602 INFO [train.py:996] (2/4) Epoch 6, batch 12400, loss[loss=0.2111, simple_loss=0.2904, pruned_loss=0.06588, over 21832.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.3002, pruned_loss=0.07301, over 4276374.97 frames. ], batch size: 298, lr: 5.14e-03, grad_scale: 16.0 2023-06-24 01:10:35,757 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=989238.0, ans=0.1 2023-06-24 01:10:36,453 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.94 vs. limit=15.0 2023-06-24 01:11:24,331 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=989358.0, ans=0.05 2023-06-24 01:11:39,877 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=989418.0, ans=0.125 2023-06-24 01:12:08,945 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.082e+02 2.631e+02 2.949e+02 3.533e+02 4.721e+02, threshold=5.899e+02, percent-clipped=0.0 2023-06-24 01:12:14,227 INFO [train.py:996] (2/4) Epoch 6, batch 12450, loss[loss=0.264, simple_loss=0.3338, pruned_loss=0.0971, over 21605.00 frames. ], tot_loss[loss=0.2276, simple_loss=0.3031, pruned_loss=0.0761, over 4280735.64 frames. ], batch size: 389, lr: 5.14e-03, grad_scale: 16.0 2023-06-24 01:13:34,397 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 01:13:46,983 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=989718.0, ans=0.125 2023-06-24 01:13:48,828 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=989778.0, ans=0.125 2023-06-24 01:14:08,587 INFO [train.py:996] (2/4) Epoch 6, batch 12500, loss[loss=0.2603, simple_loss=0.3584, pruned_loss=0.0811, over 21643.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.3145, pruned_loss=0.07966, over 4279321.59 frames. ], batch size: 263, lr: 5.14e-03, grad_scale: 16.0 2023-06-24 01:14:30,188 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=989838.0, ans=0.1 2023-06-24 01:14:42,576 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.13 vs. limit=15.0 2023-06-24 01:14:53,032 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.09 vs. limit=15.0 2023-06-24 01:15:05,476 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=989958.0, ans=0.025 2023-06-24 01:15:25,476 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=990018.0, ans=0.0 2023-06-24 01:15:29,085 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=990018.0, ans=0.125 2023-06-24 01:15:39,317 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.58 vs. limit=12.0 2023-06-24 01:16:00,987 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=990078.0, ans=0.2 2023-06-24 01:16:01,959 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.301e+02 2.735e+02 3.011e+02 3.446e+02 4.823e+02, threshold=6.021e+02, percent-clipped=0.0 2023-06-24 01:16:02,774 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=990078.0, ans=0.0 2023-06-24 01:16:07,472 INFO [train.py:996] (2/4) Epoch 6, batch 12550, loss[loss=0.1832, simple_loss=0.2188, pruned_loss=0.07376, over 19991.00 frames. ], tot_loss[loss=0.24, simple_loss=0.3172, pruned_loss=0.08142, over 4276479.30 frames. ], batch size: 703, lr: 5.14e-03, grad_scale: 16.0 2023-06-24 01:16:45,744 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=990198.0, ans=0.0 2023-06-24 01:17:10,733 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=990318.0, ans=0.0 2023-06-24 01:18:03,117 INFO [train.py:996] (2/4) Epoch 6, batch 12600, loss[loss=0.2049, simple_loss=0.2911, pruned_loss=0.05939, over 21621.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3154, pruned_loss=0.0782, over 4274486.93 frames. ], batch size: 230, lr: 5.14e-03, grad_scale: 16.0 2023-06-24 01:18:10,984 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=990438.0, ans=0.125 2023-06-24 01:18:55,911 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=990558.0, ans=0.1 2023-06-24 01:19:33,664 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=990678.0, ans=0.125 2023-06-24 01:19:46,579 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.745e+02 2.342e+02 2.712e+02 3.358e+02 5.513e+02, threshold=5.424e+02, percent-clipped=0.0 2023-06-24 01:19:51,724 INFO [train.py:996] (2/4) Epoch 6, batch 12650, loss[loss=0.2099, simple_loss=0.2837, pruned_loss=0.06806, over 21883.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.3067, pruned_loss=0.0741, over 4275531.37 frames. ], batch size: 316, lr: 5.14e-03, grad_scale: 16.0 2023-06-24 01:20:04,910 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=990738.0, ans=0.125 2023-06-24 01:20:16,098 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.30 vs. limit=12.0 2023-06-24 01:20:52,617 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=990918.0, ans=0.125 2023-06-24 01:20:53,299 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.93 vs. limit=15.0 2023-06-24 01:21:35,897 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=990978.0, ans=0.1 2023-06-24 01:21:40,703 INFO [train.py:996] (2/4) Epoch 6, batch 12700, loss[loss=0.2233, simple_loss=0.2939, pruned_loss=0.07638, over 21438.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.3076, pruned_loss=0.07676, over 4278279.60 frames. ], batch size: 211, lr: 5.14e-03, grad_scale: 16.0 2023-06-24 01:22:09,421 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.12 vs. limit=10.0 2023-06-24 01:22:13,365 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.17 vs. limit=15.0 2023-06-24 01:22:19,959 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=991098.0, ans=0.0 2023-06-24 01:22:23,175 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=991158.0, ans=0.125 2023-06-24 01:22:23,248 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=991158.0, ans=0.1 2023-06-24 01:22:31,915 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=991158.0, ans=0.1 2023-06-24 01:22:40,569 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=991218.0, ans=0.125 2023-06-24 01:23:25,533 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.805e+02 2.607e+02 2.938e+02 3.445e+02 5.217e+02, threshold=5.876e+02, percent-clipped=0.0 2023-06-24 01:23:31,061 INFO [train.py:996] (2/4) Epoch 6, batch 12750, loss[loss=0.2207, simple_loss=0.3001, pruned_loss=0.07061, over 20075.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.3089, pruned_loss=0.0772, over 4274532.68 frames. ], batch size: 702, lr: 5.14e-03, grad_scale: 16.0 2023-06-24 01:23:53,934 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=991398.0, ans=0.1 2023-06-24 01:24:50,317 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=991518.0, ans=0.0 2023-06-24 01:25:07,951 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=991578.0, ans=0.2 2023-06-24 01:25:19,760 INFO [train.py:996] (2/4) Epoch 6, batch 12800, loss[loss=0.2945, simple_loss=0.3431, pruned_loss=0.123, over 21618.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.3078, pruned_loss=0.07757, over 4276612.25 frames. ], batch size: 508, lr: 5.14e-03, grad_scale: 32.0 2023-06-24 01:26:27,787 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=991818.0, ans=0.2 2023-06-24 01:26:53,281 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=991878.0, ans=0.125 2023-06-24 01:27:06,618 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.019e+02 2.498e+02 2.671e+02 3.042e+02 5.514e+02, threshold=5.341e+02, percent-clipped=0.0 2023-06-24 01:27:10,344 INFO [train.py:996] (2/4) Epoch 6, batch 12850, loss[loss=0.2362, simple_loss=0.3071, pruned_loss=0.08265, over 20674.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3103, pruned_loss=0.07869, over 4278044.41 frames. ], batch size: 607, lr: 5.14e-03, grad_scale: 16.0 2023-06-24 01:27:12,479 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=991938.0, ans=0.0 2023-06-24 01:28:25,600 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=992118.0, ans=0.125 2023-06-24 01:28:41,634 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=992118.0, ans=0.1 2023-06-24 01:29:08,017 INFO [train.py:996] (2/4) Epoch 6, batch 12900, loss[loss=0.1949, simple_loss=0.2704, pruned_loss=0.05968, over 21184.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3086, pruned_loss=0.07512, over 4278754.87 frames. ], batch size: 159, lr: 5.13e-03, grad_scale: 16.0 2023-06-24 01:29:12,481 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=992238.0, ans=0.0 2023-06-24 01:29:23,024 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=992238.0, ans=0.5 2023-06-24 01:30:01,058 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=992358.0, ans=0.1 2023-06-24 01:30:10,308 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.47 vs. limit=22.5 2023-06-24 01:30:34,439 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=992478.0, ans=0.125 2023-06-24 01:30:55,032 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.691e+02 2.252e+02 2.502e+02 2.973e+02 5.465e+02, threshold=5.003e+02, percent-clipped=1.0 2023-06-24 01:30:58,567 INFO [train.py:996] (2/4) Epoch 6, batch 12950, loss[loss=0.1979, simple_loss=0.28, pruned_loss=0.05788, over 21696.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.3067, pruned_loss=0.0739, over 4278331.54 frames. ], batch size: 332, lr: 5.13e-03, grad_scale: 16.0 2023-06-24 01:31:40,552 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=992598.0, ans=0.125 2023-06-24 01:32:27,786 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=992718.0, ans=0.0 2023-06-24 01:32:27,818 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=992718.0, ans=0.07 2023-06-24 01:32:47,512 INFO [train.py:996] (2/4) Epoch 6, batch 13000, loss[loss=0.1929, simple_loss=0.2783, pruned_loss=0.05372, over 21828.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.3072, pruned_loss=0.07471, over 4282793.46 frames. ], batch size: 372, lr: 5.13e-03, grad_scale: 16.0 2023-06-24 01:32:55,056 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=992838.0, ans=0.2 2023-06-24 01:33:22,888 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=992898.0, ans=0.125 2023-06-24 01:33:24,584 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=992898.0, ans=0.05 2023-06-24 01:33:26,181 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=992898.0, ans=0.125 2023-06-24 01:34:15,505 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.56 vs. limit=15.0 2023-06-24 01:34:22,086 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=993078.0, ans=0.1 2023-06-24 01:34:33,613 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.751e+02 2.511e+02 2.962e+02 3.599e+02 5.386e+02, threshold=5.923e+02, percent-clipped=1.0 2023-06-24 01:34:36,952 INFO [train.py:996] (2/4) Epoch 6, batch 13050, loss[loss=0.2462, simple_loss=0.3138, pruned_loss=0.08923, over 21932.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.3045, pruned_loss=0.07337, over 4287774.86 frames. ], batch size: 415, lr: 5.13e-03, grad_scale: 16.0 2023-06-24 01:34:57,590 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.78 vs. limit=12.0 2023-06-24 01:35:21,430 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.61 vs. limit=15.0 2023-06-24 01:36:21,548 INFO [train.py:996] (2/4) Epoch 6, batch 13100, loss[loss=0.2755, simple_loss=0.421, pruned_loss=0.06504, over 19634.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.3058, pruned_loss=0.07332, over 4288348.82 frames. ], batch size: 702, lr: 5.13e-03, grad_scale: 16.0 2023-06-24 01:36:31,897 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.59 vs. limit=10.0 2023-06-24 01:36:37,926 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=993438.0, ans=0.0 2023-06-24 01:36:50,354 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.30 vs. limit=10.0 2023-06-24 01:36:56,908 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=993498.0, ans=0.125 2023-06-24 01:36:59,779 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=993498.0, ans=0.0 2023-06-24 01:37:59,524 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=993678.0, ans=0.0 2023-06-24 01:38:09,184 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.779e+02 2.775e+02 3.249e+02 4.198e+02 6.182e+02, threshold=6.497e+02, percent-clipped=2.0 2023-06-24 01:38:12,848 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.89 vs. limit=15.0 2023-06-24 01:38:18,938 INFO [train.py:996] (2/4) Epoch 6, batch 13150, loss[loss=0.2019, simple_loss=0.2612, pruned_loss=0.07126, over 21188.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.3093, pruned_loss=0.07569, over 4281809.53 frames. ], batch size: 143, lr: 5.13e-03, grad_scale: 16.0 2023-06-24 01:38:44,310 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=993798.0, ans=0.0 2023-06-24 01:38:47,631 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=993798.0, ans=0.0 2023-06-24 01:39:11,016 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=993858.0, ans=0.0 2023-06-24 01:39:12,952 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=993858.0, ans=0.0 2023-06-24 01:39:28,681 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=993918.0, ans=0.1 2023-06-24 01:40:09,797 INFO [train.py:996] (2/4) Epoch 6, batch 13200, loss[loss=0.2285, simple_loss=0.3011, pruned_loss=0.07795, over 21400.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.3078, pruned_loss=0.07535, over 4280037.36 frames. ], batch size: 549, lr: 5.13e-03, grad_scale: 32.0 2023-06-24 01:40:17,602 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.23 vs. limit=22.5 2023-06-24 01:40:56,285 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=994158.0, ans=0.04949747468305833 2023-06-24 01:41:32,156 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=994278.0, ans=0.1 2023-06-24 01:41:56,120 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.076e+02 2.674e+02 2.987e+02 3.685e+02 5.841e+02, threshold=5.974e+02, percent-clipped=0.0 2023-06-24 01:41:59,684 INFO [train.py:996] (2/4) Epoch 6, batch 13250, loss[loss=0.2286, simple_loss=0.3257, pruned_loss=0.06576, over 21800.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.3071, pruned_loss=0.07688, over 4282072.68 frames. ], batch size: 332, lr: 5.13e-03, grad_scale: 32.0 2023-06-24 01:42:35,083 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=994398.0, ans=0.0 2023-06-24 01:43:46,111 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.97 vs. limit=15.0 2023-06-24 01:43:49,657 INFO [train.py:996] (2/4) Epoch 6, batch 13300, loss[loss=0.2745, simple_loss=0.3415, pruned_loss=0.1038, over 21790.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.3099, pruned_loss=0.07657, over 4281908.87 frames. ], batch size: 124, lr: 5.13e-03, grad_scale: 32.0 2023-06-24 01:44:20,769 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=994698.0, ans=0.1 2023-06-24 01:44:56,444 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=994758.0, ans=15.0 2023-06-24 01:45:18,315 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.82 vs. limit=15.0 2023-06-24 01:45:41,756 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.934e+02 2.520e+02 2.865e+02 3.222e+02 4.480e+02, threshold=5.730e+02, percent-clipped=0.0 2023-06-24 01:45:41,800 INFO [train.py:996] (2/4) Epoch 6, batch 13350, loss[loss=0.2414, simple_loss=0.3256, pruned_loss=0.0786, over 21724.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.3134, pruned_loss=0.07883, over 4277081.01 frames. ], batch size: 351, lr: 5.13e-03, grad_scale: 8.0 2023-06-24 01:46:03,551 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=994998.0, ans=0.0 2023-06-24 01:46:54,196 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 01:46:55,884 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=995118.0, ans=0.125 2023-06-24 01:47:03,706 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.31 vs. limit=6.0 2023-06-24 01:47:32,602 INFO [train.py:996] (2/4) Epoch 6, batch 13400, loss[loss=0.2433, simple_loss=0.3087, pruned_loss=0.08901, over 21428.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3146, pruned_loss=0.08071, over 4277163.62 frames. ], batch size: 548, lr: 5.13e-03, grad_scale: 8.0 2023-06-24 01:47:35,568 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.13 vs. limit=15.0 2023-06-24 01:47:40,163 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=995238.0, ans=0.125 2023-06-24 01:48:02,178 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=995298.0, ans=0.125 2023-06-24 01:48:05,661 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=995298.0, ans=0.2 2023-06-24 01:48:26,783 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=995358.0, ans=0.2 2023-06-24 01:48:44,917 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.08 vs. limit=15.0 2023-06-24 01:49:23,336 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.257e+02 2.783e+02 3.072e+02 3.557e+02 5.639e+02, threshold=6.143e+02, percent-clipped=0.0 2023-06-24 01:49:23,381 INFO [train.py:996] (2/4) Epoch 6, batch 13450, loss[loss=0.2514, simple_loss=0.3193, pruned_loss=0.09178, over 21739.00 frames. ], tot_loss[loss=0.2403, simple_loss=0.3162, pruned_loss=0.0822, over 4272745.01 frames. ], batch size: 441, lr: 5.13e-03, grad_scale: 8.0 2023-06-24 01:50:07,299 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=995598.0, ans=0.04949747468305833 2023-06-24 01:51:13,918 INFO [train.py:996] (2/4) Epoch 6, batch 13500, loss[loss=0.2082, simple_loss=0.2812, pruned_loss=0.06758, over 21699.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3063, pruned_loss=0.07945, over 4261586.89 frames. ], batch size: 298, lr: 5.13e-03, grad_scale: 8.0 2023-06-24 01:53:06,796 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.929e+02 2.607e+02 3.013e+02 3.630e+02 7.011e+02, threshold=6.026e+02, percent-clipped=1.0 2023-06-24 01:53:06,846 INFO [train.py:996] (2/4) Epoch 6, batch 13550, loss[loss=0.244, simple_loss=0.3505, pruned_loss=0.06871, over 21799.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.3109, pruned_loss=0.07863, over 4261022.07 frames. ], batch size: 282, lr: 5.12e-03, grad_scale: 8.0 2023-06-24 01:53:59,642 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=996258.0, ans=0.125 2023-06-24 01:54:06,718 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=996258.0, ans=0.125 2023-06-24 01:54:22,215 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=996318.0, ans=0.0 2023-06-24 01:54:23,738 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=996318.0, ans=0.125 2023-06-24 01:54:35,753 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=996378.0, ans=0.125 2023-06-24 01:54:47,715 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.17 vs. limit=6.0 2023-06-24 01:54:57,335 INFO [train.py:996] (2/4) Epoch 6, batch 13600, loss[loss=0.2187, simple_loss=0.302, pruned_loss=0.0677, over 21527.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.3124, pruned_loss=0.07948, over 4259551.41 frames. ], batch size: 131, lr: 5.12e-03, grad_scale: 16.0 2023-06-24 01:54:59,855 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=996438.0, ans=0.07 2023-06-24 01:55:09,659 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=996438.0, ans=0.2 2023-06-24 01:55:48,224 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=996558.0, ans=0.2 2023-06-24 01:56:35,536 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=996678.0, ans=0.05 2023-06-24 01:56:47,196 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.983e+02 2.489e+02 2.780e+02 3.135e+02 6.333e+02, threshold=5.560e+02, percent-clipped=1.0 2023-06-24 01:56:47,232 INFO [train.py:996] (2/4) Epoch 6, batch 13650, loss[loss=0.2119, simple_loss=0.2659, pruned_loss=0.07897, over 20067.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.3072, pruned_loss=0.07608, over 4256451.54 frames. ], batch size: 703, lr: 5.12e-03, grad_scale: 16.0 2023-06-24 01:57:06,141 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=996738.0, ans=0.1 2023-06-24 01:57:21,678 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=996798.0, ans=0.0 2023-06-24 01:57:39,874 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=996858.0, ans=0.0 2023-06-24 01:58:21,982 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=996978.0, ans=0.0 2023-06-24 01:58:37,415 INFO [train.py:996] (2/4) Epoch 6, batch 13700, loss[loss=0.1908, simple_loss=0.2545, pruned_loss=0.06355, over 21255.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.3009, pruned_loss=0.07573, over 4259740.12 frames. ], batch size: 176, lr: 5.12e-03, grad_scale: 16.0 2023-06-24 01:58:42,207 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=997038.0, ans=0.125 2023-06-24 01:59:24,727 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.51 vs. limit=22.5 2023-06-24 01:59:36,748 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=997158.0, ans=0.125 2023-06-24 02:00:05,951 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=997218.0, ans=0.0 2023-06-24 02:00:41,408 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.183e+02 2.702e+02 3.112e+02 3.506e+02 5.710e+02, threshold=6.223e+02, percent-clipped=1.0 2023-06-24 02:00:41,442 INFO [train.py:996] (2/4) Epoch 6, batch 13750, loss[loss=0.2414, simple_loss=0.3202, pruned_loss=0.08133, over 21628.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.2986, pruned_loss=0.07479, over 4254578.65 frames. ], batch size: 414, lr: 5.12e-03, grad_scale: 16.0 2023-06-24 02:00:55,328 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=997338.0, ans=0.125 2023-06-24 02:01:00,939 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=997398.0, ans=0.0 2023-06-24 02:01:03,557 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.61 vs. limit=15.0 2023-06-24 02:02:22,377 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=997578.0, ans=0.125 2023-06-24 02:02:30,773 INFO [train.py:996] (2/4) Epoch 6, batch 13800, loss[loss=0.2811, simple_loss=0.3824, pruned_loss=0.08994, over 21670.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.3043, pruned_loss=0.07428, over 4261860.24 frames. ], batch size: 414, lr: 5.12e-03, grad_scale: 16.0 2023-06-24 02:02:49,773 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=997638.0, ans=0.125 2023-06-24 02:04:00,355 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 02:04:02,901 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.47 vs. limit=10.0 2023-06-24 02:04:05,748 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=997878.0, ans=0.0 2023-06-24 02:04:22,875 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.067e+02 2.948e+02 3.505e+02 4.086e+02 7.226e+02, threshold=7.009e+02, percent-clipped=3.0 2023-06-24 02:04:22,916 INFO [train.py:996] (2/4) Epoch 6, batch 13850, loss[loss=0.2998, simple_loss=0.3848, pruned_loss=0.1074, over 21706.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.3105, pruned_loss=0.07551, over 4262768.52 frames. ], batch size: 441, lr: 5.12e-03, grad_scale: 16.0 2023-06-24 02:04:25,191 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=997938.0, ans=0.125 2023-06-24 02:04:25,595 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=997938.0, ans=15.0 2023-06-24 02:04:31,964 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=997938.0, ans=0.0 2023-06-24 02:04:40,913 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=997938.0, ans=0.125 2023-06-24 02:06:17,511 INFO [train.py:996] (2/4) Epoch 6, batch 13900, loss[loss=0.2573, simple_loss=0.3595, pruned_loss=0.07749, over 20733.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3127, pruned_loss=0.07794, over 4269247.51 frames. ], batch size: 608, lr: 5.12e-03, grad_scale: 16.0 2023-06-24 02:06:22,837 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.08 vs. limit=15.0 2023-06-24 02:07:55,790 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.56 vs. limit=15.0 2023-06-24 02:08:08,073 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=6.28 vs. limit=22.5 2023-06-24 02:08:08,422 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.130e+02 2.809e+02 3.184e+02 3.702e+02 5.147e+02, threshold=6.368e+02, percent-clipped=0.0 2023-06-24 02:08:08,454 INFO [train.py:996] (2/4) Epoch 6, batch 13950, loss[loss=0.2966, simple_loss=0.354, pruned_loss=0.1196, over 21619.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3139, pruned_loss=0.08026, over 4276531.21 frames. ], batch size: 471, lr: 5.12e-03, grad_scale: 16.0 2023-06-24 02:08:45,048 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=998598.0, ans=0.0 2023-06-24 02:08:56,630 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=998658.0, ans=0.1 2023-06-24 02:09:24,973 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=998718.0, ans=0.0 2023-06-24 02:09:28,193 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=998718.0, ans=0.5 2023-06-24 02:09:31,661 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=998718.0, ans=0.125 2023-06-24 02:09:57,234 INFO [train.py:996] (2/4) Epoch 6, batch 14000, loss[loss=0.1838, simple_loss=0.259, pruned_loss=0.05431, over 21748.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3113, pruned_loss=0.07864, over 4278321.68 frames. ], batch size: 248, lr: 5.12e-03, grad_scale: 32.0 2023-06-24 02:11:41,596 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 02:11:46,196 INFO [train.py:996] (2/4) Epoch 6, batch 14050, loss[loss=0.1896, simple_loss=0.2603, pruned_loss=0.05942, over 21426.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3072, pruned_loss=0.07491, over 4279045.29 frames. ], batch size: 211, lr: 5.12e-03, grad_scale: 16.0 2023-06-24 02:11:47,706 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.602e+02 2.313e+02 2.760e+02 3.193e+02 4.998e+02, threshold=5.521e+02, percent-clipped=0.0 2023-06-24 02:11:50,675 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.98 vs. limit=15.0 2023-06-24 02:12:09,831 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=999198.0, ans=0.0 2023-06-24 02:12:13,765 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.34 vs. limit=22.5 2023-06-24 02:12:50,973 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.01 vs. limit=6.0 2023-06-24 02:13:28,743 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=999378.0, ans=0.0 2023-06-24 02:13:35,289 INFO [train.py:996] (2/4) Epoch 6, batch 14100, loss[loss=0.2452, simple_loss=0.3157, pruned_loss=0.08736, over 21922.00 frames. ], tot_loss[loss=0.225, simple_loss=0.3007, pruned_loss=0.07462, over 4272386.89 frames. ], batch size: 317, lr: 5.12e-03, grad_scale: 16.0 2023-06-24 02:13:35,990 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=999438.0, ans=0.125 2023-06-24 02:14:22,876 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=999558.0, ans=0.125 2023-06-24 02:15:15,654 INFO [train.py:996] (2/4) Epoch 6, batch 14150, loss[loss=0.245, simple_loss=0.3343, pruned_loss=0.0778, over 21648.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.3037, pruned_loss=0.07536, over 4262594.36 frames. ], batch size: 414, lr: 5.12e-03, grad_scale: 16.0 2023-06-24 02:15:17,254 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.953e+02 2.422e+02 2.767e+02 3.253e+02 5.449e+02, threshold=5.534e+02, percent-clipped=0.0 2023-06-24 02:15:41,589 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=999798.0, ans=10.0 2023-06-24 02:16:13,864 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=999858.0, ans=0.0 2023-06-24 02:16:42,421 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=999918.0, ans=0.2 2023-06-24 02:16:59,394 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=999978.0, ans=0.1 2023-06-24 02:17:02,029 INFO [train.py:996] (2/4) Epoch 6, batch 14200, loss[loss=0.2384, simple_loss=0.3041, pruned_loss=0.08631, over 21113.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.3017, pruned_loss=0.07407, over 4255071.19 frames. ], batch size: 608, lr: 5.11e-03, grad_scale: 16.0 2023-06-24 02:18:52,104 INFO [train.py:996] (2/4) Epoch 6, batch 14250, loss[loss=0.1939, simple_loss=0.2641, pruned_loss=0.06187, over 21798.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.2971, pruned_loss=0.07367, over 4254954.85 frames. ], batch size: 124, lr: 5.11e-03, grad_scale: 16.0 2023-06-24 02:18:53,607 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.840e+02 2.255e+02 2.600e+02 3.105e+02 6.584e+02, threshold=5.199e+02, percent-clipped=1.0 2023-06-24 02:20:14,329 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.96 vs. limit=22.5 2023-06-24 02:20:44,862 INFO [train.py:996] (2/4) Epoch 6, batch 14300, loss[loss=0.2613, simple_loss=0.3514, pruned_loss=0.08565, over 21810.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.2966, pruned_loss=0.07187, over 4245611.67 frames. ], batch size: 282, lr: 5.11e-03, grad_scale: 16.0 2023-06-24 02:20:59,762 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1000638.0, ans=0.1 2023-06-24 02:21:06,688 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1000698.0, ans=0.125 2023-06-24 02:21:35,533 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=1000758.0, ans=15.0 2023-06-24 02:21:58,141 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1000818.0, ans=0.125 2023-06-24 02:22:10,390 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1000818.0, ans=0.125 2023-06-24 02:22:34,173 INFO [train.py:996] (2/4) Epoch 6, batch 14350, loss[loss=0.2154, simple_loss=0.2917, pruned_loss=0.06951, over 21540.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.3013, pruned_loss=0.07256, over 4251251.16 frames. ], batch size: 195, lr: 5.11e-03, grad_scale: 16.0 2023-06-24 02:22:36,005 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.695e+02 2.573e+02 3.287e+02 4.161e+02 6.824e+02, threshold=6.573e+02, percent-clipped=7.0 2023-06-24 02:22:47,106 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1000938.0, ans=0.025 2023-06-24 02:22:50,542 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1000938.0, ans=0.1 2023-06-24 02:23:40,653 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1001058.0, ans=0.125 2023-06-24 02:24:00,989 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1001118.0, ans=0.2 2023-06-24 02:24:14,559 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1001178.0, ans=0.125 2023-06-24 02:24:25,762 INFO [train.py:996] (2/4) Epoch 6, batch 14400, loss[loss=0.2042, simple_loss=0.2693, pruned_loss=0.06956, over 21204.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.3003, pruned_loss=0.07313, over 4260204.25 frames. ], batch size: 159, lr: 5.11e-03, grad_scale: 32.0 2023-06-24 02:25:03,061 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1001298.0, ans=0.125 2023-06-24 02:25:03,728 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.90 vs. limit=15.0 2023-06-24 02:25:55,797 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1001478.0, ans=0.2 2023-06-24 02:26:08,544 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1001538.0, ans=0.0 2023-06-24 02:26:09,518 INFO [train.py:996] (2/4) Epoch 6, batch 14450, loss[loss=0.2155, simple_loss=0.284, pruned_loss=0.07346, over 21784.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.2941, pruned_loss=0.0736, over 4270921.54 frames. ], batch size: 112, lr: 5.11e-03, grad_scale: 32.0 2023-06-24 02:26:16,060 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.121e+02 2.443e+02 2.785e+02 3.113e+02 5.962e+02, threshold=5.570e+02, percent-clipped=0.0 2023-06-24 02:26:43,629 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1001598.0, ans=0.125 2023-06-24 02:26:52,694 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.78 vs. limit=22.5 2023-06-24 02:27:56,673 INFO [train.py:996] (2/4) Epoch 6, batch 14500, loss[loss=0.1949, simple_loss=0.2714, pruned_loss=0.05924, over 21850.00 frames. ], tot_loss[loss=0.2172, simple_loss=0.2896, pruned_loss=0.07241, over 4273946.39 frames. ], batch size: 118, lr: 5.11e-03, grad_scale: 32.0 2023-06-24 02:28:16,056 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1001838.0, ans=0.0 2023-06-24 02:28:32,389 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1001898.0, ans=0.2 2023-06-24 02:28:39,453 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1001898.0, ans=0.035 2023-06-24 02:29:17,079 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1002018.0, ans=0.0 2023-06-24 02:29:36,929 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1002078.0, ans=0.125 2023-06-24 02:29:44,597 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.34 vs. limit=15.0 2023-06-24 02:29:52,413 INFO [train.py:996] (2/4) Epoch 6, batch 14550, loss[loss=0.2431, simple_loss=0.3207, pruned_loss=0.08278, over 21901.00 frames. ], tot_loss[loss=0.222, simple_loss=0.295, pruned_loss=0.07454, over 4276618.45 frames. ], batch size: 316, lr: 5.11e-03, grad_scale: 32.0 2023-06-24 02:30:01,987 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.903e+02 2.448e+02 2.869e+02 3.616e+02 7.079e+02, threshold=5.738e+02, percent-clipped=4.0 2023-06-24 02:31:02,522 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1002318.0, ans=0.125 2023-06-24 02:31:20,240 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1002378.0, ans=0.09899494936611666 2023-06-24 02:31:43,854 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1002378.0, ans=0.125 2023-06-24 02:31:46,403 INFO [train.py:996] (2/4) Epoch 6, batch 14600, loss[loss=0.2096, simple_loss=0.2542, pruned_loss=0.08246, over 20265.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.3044, pruned_loss=0.07918, over 4278007.87 frames. ], batch size: 702, lr: 5.11e-03, grad_scale: 16.0 2023-06-24 02:32:00,768 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 02:32:04,134 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1002498.0, ans=0.1 2023-06-24 02:33:01,337 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1002678.0, ans=0.125 2023-06-24 02:33:22,849 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.53 vs. limit=15.0 2023-06-24 02:33:25,262 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1002678.0, ans=0.125 2023-06-24 02:33:25,381 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1002678.0, ans=0.0 2023-06-24 02:33:28,153 INFO [train.py:996] (2/4) Epoch 6, batch 14650, loss[loss=0.299, simple_loss=0.3826, pruned_loss=0.1076, over 21511.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.3072, pruned_loss=0.07869, over 4282315.28 frames. ], batch size: 471, lr: 5.11e-03, grad_scale: 16.0 2023-06-24 02:33:31,375 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.829e+02 2.911e+02 3.568e+02 4.716e+02 7.092e+02, threshold=7.135e+02, percent-clipped=11.0 2023-06-24 02:33:39,513 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.05 vs. limit=15.0 2023-06-24 02:33:40,691 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1002738.0, ans=0.125 2023-06-24 02:33:53,296 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1002798.0, ans=0.125 2023-06-24 02:33:54,923 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1002798.0, ans=0.125 2023-06-24 02:33:55,035 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1002798.0, ans=0.125 2023-06-24 02:34:43,492 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1002918.0, ans=0.125 2023-06-24 02:35:15,492 INFO [train.py:996] (2/4) Epoch 6, batch 14700, loss[loss=0.1975, simple_loss=0.2893, pruned_loss=0.0528, over 21608.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.3002, pruned_loss=0.07312, over 4282598.95 frames. ], batch size: 230, lr: 5.11e-03, grad_scale: 16.0 2023-06-24 02:35:16,885 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=16.04 vs. limit=22.5 2023-06-24 02:35:56,890 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1003098.0, ans=0.2 2023-06-24 02:37:05,514 INFO [train.py:996] (2/4) Epoch 6, batch 14750, loss[loss=0.3428, simple_loss=0.4159, pruned_loss=0.1348, over 21615.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3065, pruned_loss=0.07608, over 4281685.80 frames. ], batch size: 414, lr: 5.11e-03, grad_scale: 16.0 2023-06-24 02:37:08,874 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.655e+02 2.584e+02 3.183e+02 3.769e+02 5.952e+02, threshold=6.365e+02, percent-clipped=0.0 2023-06-24 02:37:09,667 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1003338.0, ans=0.0 2023-06-24 02:37:50,705 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1003458.0, ans=0.0 2023-06-24 02:37:51,513 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.10 vs. limit=15.0 2023-06-24 02:37:57,388 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1003458.0, ans=0.2 2023-06-24 02:38:25,326 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 02:38:59,546 INFO [train.py:996] (2/4) Epoch 6, batch 14800, loss[loss=0.309, simple_loss=0.383, pruned_loss=0.1176, over 21565.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.3173, pruned_loss=0.08067, over 4280540.79 frames. ], batch size: 389, lr: 5.11e-03, grad_scale: 32.0 2023-06-24 02:39:05,186 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1003638.0, ans=0.125 2023-06-24 02:39:30,434 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1003698.0, ans=0.0 2023-06-24 02:39:33,928 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1003698.0, ans=0.125 2023-06-24 02:40:00,281 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1003758.0, ans=0.125 2023-06-24 02:40:27,736 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1003878.0, ans=0.125 2023-06-24 02:40:27,866 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1003878.0, ans=0.125 2023-06-24 02:40:40,838 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.22 vs. limit=15.0 2023-06-24 02:40:55,587 INFO [train.py:996] (2/4) Epoch 6, batch 14850, loss[loss=0.2087, simple_loss=0.2655, pruned_loss=0.07596, over 21334.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3101, pruned_loss=0.07933, over 4281929.70 frames. ], batch size: 160, lr: 5.10e-03, grad_scale: 32.0 2023-06-24 02:40:59,018 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.106e+02 2.678e+02 3.116e+02 4.005e+02 6.901e+02, threshold=6.233e+02, percent-clipped=1.0 2023-06-24 02:41:23,738 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 02:41:31,150 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.whiten.whitening_limit, batch_count=1003998.0, ans=15.0 2023-06-24 02:42:13,605 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1004118.0, ans=0.125 2023-06-24 02:42:35,265 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 02:42:40,799 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1004178.0, ans=0.125 2023-06-24 02:42:42,473 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1004178.0, ans=0.125 2023-06-24 02:42:47,055 INFO [train.py:996] (2/4) Epoch 6, batch 14900, loss[loss=0.2504, simple_loss=0.3292, pruned_loss=0.0858, over 21831.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3148, pruned_loss=0.08124, over 4280421.65 frames. ], batch size: 124, lr: 5.10e-03, grad_scale: 32.0 2023-06-24 02:42:49,543 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1004238.0, ans=0.04949747468305833 2023-06-24 02:44:00,352 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1004418.0, ans=0.0 2023-06-24 02:44:36,528 INFO [train.py:996] (2/4) Epoch 6, batch 14950, loss[loss=0.2003, simple_loss=0.2875, pruned_loss=0.05651, over 21701.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3149, pruned_loss=0.08058, over 4275560.78 frames. ], batch size: 247, lr: 5.10e-03, grad_scale: 32.0 2023-06-24 02:44:39,949 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.887e+02 2.635e+02 3.010e+02 3.574e+02 5.643e+02, threshold=6.019e+02, percent-clipped=0.0 2023-06-24 02:45:08,939 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1004598.0, ans=0.125 2023-06-24 02:46:09,929 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1004778.0, ans=0.0 2023-06-24 02:46:24,992 INFO [train.py:996] (2/4) Epoch 6, batch 15000, loss[loss=0.2564, simple_loss=0.3293, pruned_loss=0.0918, over 21621.00 frames. ], tot_loss[loss=0.2406, simple_loss=0.3171, pruned_loss=0.08207, over 4283640.27 frames. ], batch size: 389, lr: 5.10e-03, grad_scale: 32.0 2023-06-24 02:46:24,993 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-24 02:46:45,313 INFO [train.py:1028] (2/4) Epoch 6, validation: loss=0.2621, simple_loss=0.3511, pruned_loss=0.08652, over 1796401.00 frames. 2023-06-24 02:46:45,314 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 23731MB 2023-06-24 02:47:26,483 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1004898.0, ans=0.125 2023-06-24 02:48:12,121 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 02:48:36,408 INFO [train.py:996] (2/4) Epoch 6, batch 15050, loss[loss=0.2133, simple_loss=0.2847, pruned_loss=0.07092, over 21288.00 frames. ], tot_loss[loss=0.2424, simple_loss=0.3179, pruned_loss=0.08346, over 4283251.51 frames. ], batch size: 176, lr: 5.10e-03, grad_scale: 32.0 2023-06-24 02:48:45,231 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.077e+02 2.748e+02 3.194e+02 3.808e+02 5.890e+02, threshold=6.387e+02, percent-clipped=0.0 2023-06-24 02:49:23,205 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1005198.0, ans=0.0 2023-06-24 02:49:27,427 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.59 vs. limit=15.0 2023-06-24 02:50:31,383 INFO [train.py:996] (2/4) Epoch 6, batch 15100, loss[loss=0.2418, simple_loss=0.3125, pruned_loss=0.0856, over 20619.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.3198, pruned_loss=0.08289, over 4273394.41 frames. ], batch size: 607, lr: 5.10e-03, grad_scale: 32.0 2023-06-24 02:50:48,250 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1005438.0, ans=0.125 2023-06-24 02:51:21,883 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.60 vs. limit=22.5 2023-06-24 02:52:10,969 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1005678.0, ans=0.0 2023-06-24 02:52:20,518 INFO [train.py:996] (2/4) Epoch 6, batch 15150, loss[loss=0.2054, simple_loss=0.2645, pruned_loss=0.07317, over 15445.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3149, pruned_loss=0.08218, over 4269766.70 frames. ], batch size: 60, lr: 5.10e-03, grad_scale: 32.0 2023-06-24 02:52:29,946 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.002e+02 2.489e+02 2.718e+02 3.127e+02 6.231e+02, threshold=5.435e+02, percent-clipped=0.0 2023-06-24 02:52:37,839 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.75 vs. limit=15.0 2023-06-24 02:52:46,013 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1005738.0, ans=0.2 2023-06-24 02:53:53,295 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.61 vs. limit=15.0 2023-06-24 02:54:14,599 INFO [train.py:996] (2/4) Epoch 6, batch 15200, loss[loss=0.2022, simple_loss=0.293, pruned_loss=0.05569, over 21616.00 frames. ], tot_loss[loss=0.232, simple_loss=0.307, pruned_loss=0.07848, over 4269821.09 frames. ], batch size: 414, lr: 5.10e-03, grad_scale: 32.0 2023-06-24 02:54:20,476 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.whiten.whitening_limit, batch_count=1006038.0, ans=12.0 2023-06-24 02:54:27,411 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.33 vs. limit=12.0 2023-06-24 02:55:22,771 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1006218.0, ans=0.0 2023-06-24 02:55:42,817 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1006278.0, ans=0.1 2023-06-24 02:55:47,920 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1006278.0, ans=10.0 2023-06-24 02:56:03,347 INFO [train.py:996] (2/4) Epoch 6, batch 15250, loss[loss=0.2449, simple_loss=0.2956, pruned_loss=0.09705, over 21253.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3019, pruned_loss=0.07752, over 4264266.55 frames. ], batch size: 471, lr: 5.10e-03, grad_scale: 16.0 2023-06-24 02:56:13,788 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.617e+02 2.536e+02 2.850e+02 3.419e+02 5.207e+02, threshold=5.701e+02, percent-clipped=0.0 2023-06-24 02:56:50,972 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1006458.0, ans=0.2 2023-06-24 02:56:52,476 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1006458.0, ans=0.125 2023-06-24 02:57:06,448 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1006518.0, ans=0.2 2023-06-24 02:57:27,627 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1006518.0, ans=0.0 2023-06-24 02:57:58,127 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.42 vs. limit=10.0 2023-06-24 02:57:58,582 INFO [train.py:996] (2/4) Epoch 6, batch 15300, loss[loss=0.1921, simple_loss=0.2446, pruned_loss=0.06979, over 20765.00 frames. ], tot_loss[loss=0.231, simple_loss=0.3027, pruned_loss=0.07967, over 4270345.99 frames. ], batch size: 609, lr: 5.10e-03, grad_scale: 16.0 2023-06-24 02:58:07,965 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1006638.0, ans=0.125 2023-06-24 02:58:09,497 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1006638.0, ans=0.125 2023-06-24 02:59:20,760 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1006878.0, ans=0.2 2023-06-24 02:59:48,121 INFO [train.py:996] (2/4) Epoch 6, batch 15350, loss[loss=0.27, simple_loss=0.3408, pruned_loss=0.09953, over 21259.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.3092, pruned_loss=0.08279, over 4270751.87 frames. ], batch size: 176, lr: 5.10e-03, grad_scale: 16.0 2023-06-24 02:59:50,306 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1006938.0, ans=0.0 2023-06-24 02:59:52,937 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.904e+02 2.681e+02 3.062e+02 3.788e+02 5.909e+02, threshold=6.124e+02, percent-clipped=1.0 2023-06-24 03:00:35,164 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.20 vs. limit=10.0 2023-06-24 03:01:23,782 INFO [train.py:996] (2/4) Epoch 6, batch 15400, loss[loss=0.2348, simple_loss=0.3096, pruned_loss=0.07999, over 21879.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3107, pruned_loss=0.08098, over 4255805.21 frames. ], batch size: 118, lr: 5.10e-03, grad_scale: 16.0 2023-06-24 03:01:35,131 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1007238.0, ans=0.125 2023-06-24 03:01:55,911 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1007298.0, ans=0.125 2023-06-24 03:02:06,254 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1007358.0, ans=0.0 2023-06-24 03:02:06,323 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1007358.0, ans=0.1 2023-06-24 03:02:35,343 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1007418.0, ans=0.1 2023-06-24 03:02:41,196 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.28 vs. limit=10.0 2023-06-24 03:02:54,011 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.79 vs. limit=15.0 2023-06-24 03:03:12,816 INFO [train.py:996] (2/4) Epoch 6, batch 15450, loss[loss=0.2104, simple_loss=0.2871, pruned_loss=0.06687, over 21436.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3078, pruned_loss=0.08001, over 4261692.61 frames. ], batch size: 131, lr: 5.10e-03, grad_scale: 16.0 2023-06-24 03:03:23,435 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.018e+02 2.379e+02 2.689e+02 3.180e+02 6.204e+02, threshold=5.379e+02, percent-clipped=1.0 2023-06-24 03:04:28,804 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1007718.0, ans=0.125 2023-06-24 03:05:07,253 INFO [train.py:996] (2/4) Epoch 6, batch 15500, loss[loss=0.2556, simple_loss=0.3296, pruned_loss=0.09084, over 21465.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3104, pruned_loss=0.07948, over 4251148.97 frames. ], batch size: 131, lr: 5.09e-03, grad_scale: 16.0 2023-06-24 03:05:37,736 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1007898.0, ans=0.1 2023-06-24 03:06:58,637 INFO [train.py:996] (2/4) Epoch 6, batch 15550, loss[loss=0.213, simple_loss=0.2798, pruned_loss=0.07312, over 21117.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3089, pruned_loss=0.07764, over 4253736.36 frames. ], batch size: 143, lr: 5.09e-03, grad_scale: 16.0 2023-06-24 03:07:03,921 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.019e+02 2.505e+02 2.792e+02 3.296e+02 4.983e+02, threshold=5.584e+02, percent-clipped=0.0 2023-06-24 03:07:25,369 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.35 vs. limit=15.0 2023-06-24 03:08:05,803 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1008318.0, ans=0.0 2023-06-24 03:08:26,598 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1008378.0, ans=0.125 2023-06-24 03:08:26,720 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1008378.0, ans=0.0 2023-06-24 03:08:38,841 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.95 vs. limit=6.0 2023-06-24 03:08:40,775 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.05 vs. limit=6.0 2023-06-24 03:08:46,200 INFO [train.py:996] (2/4) Epoch 6, batch 15600, loss[loss=0.2675, simple_loss=0.3369, pruned_loss=0.09912, over 21449.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.3034, pruned_loss=0.07598, over 4239702.35 frames. ], batch size: 508, lr: 5.09e-03, grad_scale: 32.0 2023-06-24 03:09:08,965 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1008498.0, ans=0.2 2023-06-24 03:09:29,995 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1008558.0, ans=10.0 2023-06-24 03:09:50,900 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1008618.0, ans=0.125 2023-06-24 03:10:33,934 INFO [train.py:996] (2/4) Epoch 6, batch 15650, loss[loss=0.2018, simple_loss=0.2708, pruned_loss=0.06635, over 21662.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.3018, pruned_loss=0.07537, over 4234202.86 frames. ], batch size: 282, lr: 5.09e-03, grad_scale: 32.0 2023-06-24 03:10:34,661 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1008738.0, ans=0.2 2023-06-24 03:10:39,284 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.960e+02 2.465e+02 2.724e+02 3.048e+02 4.286e+02, threshold=5.447e+02, percent-clipped=0.0 2023-06-24 03:12:08,817 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1008978.0, ans=0.2 2023-06-24 03:12:10,748 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1008978.0, ans=0.0 2023-06-24 03:12:21,553 INFO [train.py:996] (2/4) Epoch 6, batch 15700, loss[loss=0.1945, simple_loss=0.2594, pruned_loss=0.06482, over 21856.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.2978, pruned_loss=0.0749, over 4239847.01 frames. ], batch size: 107, lr: 5.09e-03, grad_scale: 32.0 2023-06-24 03:12:24,011 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1009038.0, ans=0.125 2023-06-24 03:12:58,338 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1009098.0, ans=0.0 2023-06-24 03:13:41,349 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.05 vs. limit=15.0 2023-06-24 03:14:08,904 INFO [train.py:996] (2/4) Epoch 6, batch 15750, loss[loss=0.2209, simple_loss=0.2924, pruned_loss=0.07468, over 21796.00 frames. ], tot_loss[loss=0.2215, simple_loss=0.2935, pruned_loss=0.0748, over 4239819.68 frames. ], batch size: 317, lr: 5.09e-03, grad_scale: 32.0 2023-06-24 03:14:11,414 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1009338.0, ans=0.125 2023-06-24 03:14:14,106 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.808e+02 2.454e+02 2.677e+02 3.133e+02 4.467e+02, threshold=5.354e+02, percent-clipped=0.0 2023-06-24 03:15:57,574 INFO [train.py:996] (2/4) Epoch 6, batch 15800, loss[loss=0.1985, simple_loss=0.2652, pruned_loss=0.06586, over 21734.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.289, pruned_loss=0.07415, over 4246540.10 frames. ], batch size: 124, lr: 5.09e-03, grad_scale: 32.0 2023-06-24 03:16:10,899 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1009638.0, ans=0.1 2023-06-24 03:16:12,510 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1009638.0, ans=0.05 2023-06-24 03:16:14,210 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1009698.0, ans=0.1 2023-06-24 03:16:42,125 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1009758.0, ans=0.125 2023-06-24 03:17:20,639 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1009818.0, ans=0.2 2023-06-24 03:17:45,449 INFO [train.py:996] (2/4) Epoch 6, batch 15850, loss[loss=0.1936, simple_loss=0.256, pruned_loss=0.06563, over 20724.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.2923, pruned_loss=0.07665, over 4248666.36 frames. ], batch size: 608, lr: 5.09e-03, grad_scale: 32.0 2023-06-24 03:17:50,482 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.171e+02 2.697e+02 2.988e+02 3.672e+02 5.659e+02, threshold=5.976e+02, percent-clipped=2.0 2023-06-24 03:18:48,802 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.76 vs. limit=22.5 2023-06-24 03:18:56,731 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1010118.0, ans=0.0 2023-06-24 03:19:26,495 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.19 vs. limit=15.0 2023-06-24 03:19:29,904 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.92 vs. limit=15.0 2023-06-24 03:19:30,924 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1010238.0, ans=0.125 2023-06-24 03:19:32,113 INFO [train.py:996] (2/4) Epoch 6, batch 15900, loss[loss=0.2126, simple_loss=0.2944, pruned_loss=0.06542, over 21370.00 frames. ], tot_loss[loss=0.2214, simple_loss=0.2901, pruned_loss=0.07629, over 4258611.29 frames. ], batch size: 194, lr: 5.09e-03, grad_scale: 32.0 2023-06-24 03:19:48,429 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 03:20:59,982 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1010478.0, ans=0.125 2023-06-24 03:21:19,544 INFO [train.py:996] (2/4) Epoch 6, batch 15950, loss[loss=0.1639, simple_loss=0.2603, pruned_loss=0.03379, over 21762.00 frames. ], tot_loss[loss=0.2192, simple_loss=0.2915, pruned_loss=0.0734, over 4257747.27 frames. ], batch size: 298, lr: 5.09e-03, grad_scale: 32.0 2023-06-24 03:21:21,893 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1010538.0, ans=0.5 2023-06-24 03:21:24,429 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.544e+02 2.251e+02 2.569e+02 3.023e+02 4.641e+02, threshold=5.138e+02, percent-clipped=0.0 2023-06-24 03:21:44,442 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1010598.0, ans=0.0 2023-06-24 03:22:05,445 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1010658.0, ans=0.0 2023-06-24 03:22:24,540 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1010718.0, ans=0.0 2023-06-24 03:22:28,010 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 03:22:31,488 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=1010718.0, ans=0.025 2023-06-24 03:22:46,773 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1010778.0, ans=0.125 2023-06-24 03:22:48,613 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1010778.0, ans=0.125 2023-06-24 03:23:07,194 INFO [train.py:996] (2/4) Epoch 6, batch 16000, loss[loss=0.2648, simple_loss=0.3382, pruned_loss=0.09567, over 21769.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.2943, pruned_loss=0.07224, over 4267553.09 frames. ], batch size: 441, lr: 5.09e-03, grad_scale: 32.0 2023-06-24 03:23:23,004 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.10 vs. limit=15.0 2023-06-24 03:23:29,041 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1010898.0, ans=0.0 2023-06-24 03:24:36,118 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1011078.0, ans=0.0 2023-06-24 03:24:36,126 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1011078.0, ans=0.2 2023-06-24 03:24:55,886 INFO [train.py:996] (2/4) Epoch 6, batch 16050, loss[loss=0.1586, simple_loss=0.2366, pruned_loss=0.04033, over 21813.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.2951, pruned_loss=0.0698, over 4273265.87 frames. ], batch size: 102, lr: 5.09e-03, grad_scale: 16.0 2023-06-24 03:25:02,637 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.897e+02 2.499e+02 2.877e+02 3.627e+02 5.675e+02, threshold=5.753e+02, percent-clipped=3.0 2023-06-24 03:25:03,430 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1011138.0, ans=0.1 2023-06-24 03:25:11,931 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1011198.0, ans=0.1 2023-06-24 03:25:17,465 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1011198.0, ans=0.125 2023-06-24 03:25:46,023 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1011258.0, ans=0.2 2023-06-24 03:25:56,601 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.94 vs. limit=15.0 2023-06-24 03:25:59,570 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1011318.0, ans=0.0 2023-06-24 03:26:04,528 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1011318.0, ans=0.125 2023-06-24 03:26:17,060 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1011318.0, ans=0.1 2023-06-24 03:26:24,175 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1011378.0, ans=0.2 2023-06-24 03:26:42,144 INFO [train.py:996] (2/4) Epoch 6, batch 16100, loss[loss=0.2632, simple_loss=0.3283, pruned_loss=0.09908, over 21867.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.2996, pruned_loss=0.07148, over 4270352.90 frames. ], batch size: 107, lr: 5.09e-03, grad_scale: 16.0 2023-06-24 03:26:42,788 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1011438.0, ans=0.2 2023-06-24 03:26:56,845 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1011438.0, ans=0.0 2023-06-24 03:28:23,903 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1011678.0, ans=0.2 2023-06-24 03:28:28,677 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1011678.0, ans=0.0 2023-06-24 03:28:31,517 INFO [train.py:996] (2/4) Epoch 6, batch 16150, loss[loss=0.2012, simple_loss=0.2711, pruned_loss=0.06571, over 21528.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.2999, pruned_loss=0.07285, over 4278929.83 frames. ], batch size: 212, lr: 5.08e-03, grad_scale: 16.0 2023-06-24 03:28:34,018 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1011738.0, ans=0.0 2023-06-24 03:28:38,460 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.889e+02 2.535e+02 2.977e+02 3.474e+02 6.271e+02, threshold=5.955e+02, percent-clipped=2.0 2023-06-24 03:29:30,621 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1011858.0, ans=0.1 2023-06-24 03:30:02,221 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1011978.0, ans=0.125 2023-06-24 03:30:21,149 INFO [train.py:996] (2/4) Epoch 6, batch 16200, loss[loss=0.2056, simple_loss=0.2691, pruned_loss=0.07107, over 21193.00 frames. ], tot_loss[loss=0.225, simple_loss=0.3027, pruned_loss=0.07363, over 4276308.78 frames. ], batch size: 608, lr: 5.08e-03, grad_scale: 16.0 2023-06-24 03:31:30,563 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1012218.0, ans=0.1 2023-06-24 03:31:31,163 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.42 vs. limit=15.0 2023-06-24 03:32:09,589 INFO [train.py:996] (2/4) Epoch 6, batch 16250, loss[loss=0.2064, simple_loss=0.2857, pruned_loss=0.06352, over 21504.00 frames. ], tot_loss[loss=0.2257, simple_loss=0.3031, pruned_loss=0.07417, over 4278659.64 frames. ], batch size: 389, lr: 5.08e-03, grad_scale: 16.0 2023-06-24 03:32:16,276 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.059e+02 2.579e+02 2.975e+02 3.411e+02 5.928e+02, threshold=5.950e+02, percent-clipped=0.0 2023-06-24 03:33:44,614 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 03:33:57,768 INFO [train.py:996] (2/4) Epoch 6, batch 16300, loss[loss=0.1841, simple_loss=0.2759, pruned_loss=0.04613, over 21643.00 frames. ], tot_loss[loss=0.219, simple_loss=0.2966, pruned_loss=0.07072, over 4274013.03 frames. ], batch size: 263, lr: 5.08e-03, grad_scale: 16.0 2023-06-24 03:34:00,394 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1012638.0, ans=0.1 2023-06-24 03:34:15,089 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.whiten.whitening_limit, batch_count=1012698.0, ans=15.0 2023-06-24 03:34:37,102 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1012698.0, ans=0.0 2023-06-24 03:34:51,143 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1012758.0, ans=0.2 2023-06-24 03:35:23,508 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1012818.0, ans=0.04949747468305833 2023-06-24 03:35:43,198 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1012878.0, ans=0.0 2023-06-24 03:35:48,000 INFO [train.py:996] (2/4) Epoch 6, batch 16350, loss[loss=0.2036, simple_loss=0.2948, pruned_loss=0.05622, over 20758.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.2983, pruned_loss=0.07247, over 4274573.93 frames. ], batch size: 607, lr: 5.08e-03, grad_scale: 16.0 2023-06-24 03:35:48,652 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1012938.0, ans=0.1 2023-06-24 03:36:00,046 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.811e+02 2.290e+02 2.661e+02 3.043e+02 4.876e+02, threshold=5.321e+02, percent-clipped=0.0 2023-06-24 03:36:15,125 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1012998.0, ans=0.125 2023-06-24 03:36:23,761 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1012998.0, ans=0.125 2023-06-24 03:36:47,676 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.96 vs. limit=22.5 2023-06-24 03:37:26,616 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1013178.0, ans=0.1 2023-06-24 03:37:32,025 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1013178.0, ans=0.2 2023-06-24 03:37:33,676 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1013178.0, ans=0.125 2023-06-24 03:37:36,533 INFO [train.py:996] (2/4) Epoch 6, batch 16400, loss[loss=0.2164, simple_loss=0.2929, pruned_loss=0.06999, over 21441.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.3027, pruned_loss=0.07494, over 4281505.93 frames. ], batch size: 548, lr: 5.08e-03, grad_scale: 32.0 2023-06-24 03:37:45,691 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1013238.0, ans=0.95 2023-06-24 03:38:00,189 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1013238.0, ans=0.0 2023-06-24 03:38:00,206 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1013238.0, ans=0.125 2023-06-24 03:38:19,985 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.26 vs. limit=15.0 2023-06-24 03:39:30,137 INFO [train.py:996] (2/4) Epoch 6, batch 16450, loss[loss=0.2188, simple_loss=0.2888, pruned_loss=0.07437, over 21477.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.3029, pruned_loss=0.07623, over 4285922.63 frames. ], batch size: 211, lr: 5.08e-03, grad_scale: 32.0 2023-06-24 03:39:42,778 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.970e+02 2.477e+02 2.722e+02 3.151e+02 4.827e+02, threshold=5.443e+02, percent-clipped=0.0 2023-06-24 03:40:59,153 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.88 vs. limit=22.5 2023-06-24 03:41:26,349 INFO [train.py:996] (2/4) Epoch 6, batch 16500, loss[loss=0.2414, simple_loss=0.3614, pruned_loss=0.06069, over 19794.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.3001, pruned_loss=0.0753, over 4276172.45 frames. ], batch size: 703, lr: 5.08e-03, grad_scale: 32.0 2023-06-24 03:41:52,139 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1013898.0, ans=0.0 2023-06-24 03:43:15,532 INFO [train.py:996] (2/4) Epoch 6, batch 16550, loss[loss=0.2443, simple_loss=0.3273, pruned_loss=0.08067, over 21598.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.2999, pruned_loss=0.0732, over 4278152.16 frames. ], batch size: 389, lr: 5.08e-03, grad_scale: 32.0 2023-06-24 03:43:22,455 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.907e+02 2.591e+02 3.154e+02 3.856e+02 7.253e+02, threshold=6.309e+02, percent-clipped=4.0 2023-06-24 03:44:06,734 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1014258.0, ans=0.125 2023-06-24 03:45:06,851 INFO [train.py:996] (2/4) Epoch 6, batch 16600, loss[loss=0.1965, simple_loss=0.3215, pruned_loss=0.03576, over 20827.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.3086, pruned_loss=0.07621, over 4267551.46 frames. ], batch size: 608, lr: 5.08e-03, grad_scale: 32.0 2023-06-24 03:45:52,740 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 03:46:33,278 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1014618.0, ans=0.1 2023-06-24 03:46:33,765 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.29 vs. limit=22.5 2023-06-24 03:46:57,601 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1014678.0, ans=0.2 2023-06-24 03:47:02,042 INFO [train.py:996] (2/4) Epoch 6, batch 16650, loss[loss=0.2622, simple_loss=0.3358, pruned_loss=0.09433, over 21803.00 frames. ], tot_loss[loss=0.2382, simple_loss=0.3178, pruned_loss=0.0793, over 4266714.57 frames. ], batch size: 441, lr: 5.08e-03, grad_scale: 32.0 2023-06-24 03:47:14,436 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.134e+02 2.632e+02 2.959e+02 3.254e+02 5.416e+02, threshold=5.917e+02, percent-clipped=0.0 2023-06-24 03:47:26,820 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.52 vs. limit=12.0 2023-06-24 03:48:12,595 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.95 vs. limit=22.5 2023-06-24 03:48:21,766 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.30 vs. limit=15.0 2023-06-24 03:48:30,997 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=12.02 vs. limit=15.0 2023-06-24 03:48:32,326 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1014918.0, ans=0.1 2023-06-24 03:48:34,017 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1014918.0, ans=0.125 2023-06-24 03:48:38,263 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.30 vs. limit=6.0 2023-06-24 03:48:59,633 INFO [train.py:996] (2/4) Epoch 6, batch 16700, loss[loss=0.2049, simple_loss=0.2678, pruned_loss=0.07102, over 21126.00 frames. ], tot_loss[loss=0.2402, simple_loss=0.3194, pruned_loss=0.08046, over 4257630.54 frames. ], batch size: 143, lr: 5.08e-03, grad_scale: 32.0 2023-06-24 03:49:30,913 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1015098.0, ans=0.0 2023-06-24 03:49:36,730 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.58 vs. limit=15.0 2023-06-24 03:49:50,845 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1015158.0, ans=0.125 2023-06-24 03:49:51,415 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.40 vs. limit=22.5 2023-06-24 03:49:59,261 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.25 vs. limit=22.5 2023-06-24 03:50:25,843 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1015218.0, ans=0.2 2023-06-24 03:50:58,293 INFO [train.py:996] (2/4) Epoch 6, batch 16750, loss[loss=0.2712, simple_loss=0.3651, pruned_loss=0.08867, over 21288.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.3225, pruned_loss=0.08306, over 4267740.35 frames. ], batch size: 549, lr: 5.08e-03, grad_scale: 16.0 2023-06-24 03:51:13,353 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.212e+02 2.841e+02 3.113e+02 3.878e+02 5.035e+02, threshold=6.225e+02, percent-clipped=0.0 2023-06-24 03:51:39,023 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1015398.0, ans=0.1 2023-06-24 03:51:46,170 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1015458.0, ans=0.125 2023-06-24 03:51:54,451 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1015458.0, ans=10.0 2023-06-24 03:52:01,581 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1015458.0, ans=0.1 2023-06-24 03:52:55,209 INFO [train.py:996] (2/4) Epoch 6, batch 16800, loss[loss=0.2188, simple_loss=0.2929, pruned_loss=0.07241, over 21813.00 frames. ], tot_loss[loss=0.2468, simple_loss=0.3261, pruned_loss=0.08369, over 4268209.71 frames. ], batch size: 282, lr: 5.08e-03, grad_scale: 32.0 2023-06-24 03:53:34,642 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_na.min_abs, batch_count=1015698.0, ans=0.02 2023-06-24 03:54:20,172 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1015878.0, ans=0.125 2023-06-24 03:54:44,522 INFO [train.py:996] (2/4) Epoch 6, batch 16850, loss[loss=0.2219, simple_loss=0.2862, pruned_loss=0.0788, over 21617.00 frames. ], tot_loss[loss=0.2445, simple_loss=0.3222, pruned_loss=0.08345, over 4268096.78 frames. ], batch size: 548, lr: 5.07e-03, grad_scale: 32.0 2023-06-24 03:54:53,527 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.099e+02 2.780e+02 3.302e+02 4.313e+02 7.428e+02, threshold=6.605e+02, percent-clipped=4.0 2023-06-24 03:54:59,848 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.82 vs. limit=15.0 2023-06-24 03:55:39,209 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1016058.0, ans=0.125 2023-06-24 03:55:56,466 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1016118.0, ans=0.125 2023-06-24 03:56:03,367 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1016118.0, ans=0.1 2023-06-24 03:56:32,173 INFO [train.py:996] (2/4) Epoch 6, batch 16900, loss[loss=0.19, simple_loss=0.2684, pruned_loss=0.05576, over 21292.00 frames. ], tot_loss[loss=0.2395, simple_loss=0.3161, pruned_loss=0.08147, over 4273214.75 frames. ], batch size: 159, lr: 5.07e-03, grad_scale: 32.0 2023-06-24 03:56:48,219 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1016298.0, ans=0.2 2023-06-24 03:56:49,848 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1016298.0, ans=0.125 2023-06-24 03:57:33,070 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1016418.0, ans=0.125 2023-06-24 03:58:19,454 INFO [train.py:996] (2/4) Epoch 6, batch 16950, loss[loss=0.2279, simple_loss=0.2984, pruned_loss=0.07873, over 21838.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.309, pruned_loss=0.0802, over 4265962.16 frames. ], batch size: 124, lr: 5.07e-03, grad_scale: 16.0 2023-06-24 03:58:29,701 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.847e+02 2.437e+02 2.853e+02 3.182e+02 4.700e+02, threshold=5.707e+02, percent-clipped=0.0 2023-06-24 03:58:39,412 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1016598.0, ans=0.125 2023-06-24 03:59:26,842 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1016718.0, ans=0.2 2023-06-24 04:00:03,693 INFO [train.py:996] (2/4) Epoch 6, batch 17000, loss[loss=0.2257, simple_loss=0.3018, pruned_loss=0.0748, over 21809.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3055, pruned_loss=0.08005, over 4273361.14 frames. ], batch size: 112, lr: 5.07e-03, grad_scale: 16.0 2023-06-24 04:00:56,396 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1016958.0, ans=0.125 2023-06-24 04:01:03,122 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1016958.0, ans=0.125 2023-06-24 04:01:07,398 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.69 vs. limit=15.0 2023-06-24 04:01:35,454 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1017078.0, ans=0.125 2023-06-24 04:01:54,115 INFO [train.py:996] (2/4) Epoch 6, batch 17050, loss[loss=0.2542, simple_loss=0.3364, pruned_loss=0.08598, over 21422.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3112, pruned_loss=0.0817, over 4269104.03 frames. ], batch size: 548, lr: 5.07e-03, grad_scale: 16.0 2023-06-24 04:02:04,590 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.138e+02 2.608e+02 3.012e+02 3.512e+02 5.895e+02, threshold=6.025e+02, percent-clipped=1.0 2023-06-24 04:02:12,771 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1017138.0, ans=0.0 2023-06-24 04:02:14,430 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 04:02:35,124 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1017258.0, ans=0.125 2023-06-24 04:03:16,541 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1017318.0, ans=0.2 2023-06-24 04:03:36,151 INFO [train.py:996] (2/4) Epoch 6, batch 17100, loss[loss=0.2218, simple_loss=0.2968, pruned_loss=0.07341, over 21847.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.3105, pruned_loss=0.08226, over 4280771.93 frames. ], batch size: 98, lr: 5.07e-03, grad_scale: 16.0 2023-06-24 04:04:17,002 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1017498.0, ans=0.125 2023-06-24 04:05:17,470 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1017678.0, ans=0.125 2023-06-24 04:05:23,007 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1017738.0, ans=0.125 2023-06-24 04:05:23,951 INFO [train.py:996] (2/4) Epoch 6, batch 17150, loss[loss=0.1941, simple_loss=0.2647, pruned_loss=0.06176, over 21425.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3061, pruned_loss=0.08154, over 4289619.37 frames. ], batch size: 131, lr: 5.07e-03, grad_scale: 16.0 2023-06-24 04:05:44,958 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.872e+02 2.638e+02 2.899e+02 3.354e+02 4.965e+02, threshold=5.799e+02, percent-clipped=0.0 2023-06-24 04:05:51,071 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1017798.0, ans=0.125 2023-06-24 04:05:56,283 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1017798.0, ans=0.0 2023-06-24 04:06:08,279 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1017858.0, ans=0.125 2023-06-24 04:06:49,829 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1017918.0, ans=0.125 2023-06-24 04:06:57,601 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1017978.0, ans=0.95 2023-06-24 04:07:17,732 INFO [train.py:996] (2/4) Epoch 6, batch 17200, loss[loss=0.2727, simple_loss=0.3422, pruned_loss=0.1016, over 21589.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3067, pruned_loss=0.082, over 4292012.45 frames. ], batch size: 415, lr: 5.07e-03, grad_scale: 32.0 2023-06-24 04:08:16,709 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1018158.0, ans=0.07 2023-06-24 04:08:50,112 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1018278.0, ans=0.0 2023-06-24 04:09:12,577 INFO [train.py:996] (2/4) Epoch 6, batch 17250, loss[loss=0.2456, simple_loss=0.3358, pruned_loss=0.07773, over 21432.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3101, pruned_loss=0.08297, over 4285390.98 frames. ], batch size: 131, lr: 5.07e-03, grad_scale: 16.0 2023-06-24 04:09:25,160 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.067e+02 2.699e+02 3.105e+02 3.621e+02 5.993e+02, threshold=6.210e+02, percent-clipped=1.0 2023-06-24 04:10:20,284 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1018518.0, ans=0.04949747468305833 2023-06-24 04:10:41,950 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.85 vs. limit=12.0 2023-06-24 04:11:01,905 INFO [train.py:996] (2/4) Epoch 6, batch 17300, loss[loss=0.2726, simple_loss=0.3439, pruned_loss=0.1006, over 21283.00 frames. ], tot_loss[loss=0.2455, simple_loss=0.3188, pruned_loss=0.08611, over 4281964.67 frames. ], batch size: 143, lr: 5.07e-03, grad_scale: 16.0 2023-06-24 04:11:52,108 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1018758.0, ans=0.125 2023-06-24 04:12:20,114 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.84 vs. limit=22.5 2023-06-24 04:12:29,153 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1018818.0, ans=0.0 2023-06-24 04:12:55,339 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1018878.0, ans=0.0 2023-06-24 04:12:58,379 INFO [train.py:996] (2/4) Epoch 6, batch 17350, loss[loss=0.2397, simple_loss=0.341, pruned_loss=0.06918, over 21245.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.3193, pruned_loss=0.08568, over 4274803.42 frames. ], batch size: 548, lr: 5.07e-03, grad_scale: 16.0 2023-06-24 04:13:12,812 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1018938.0, ans=0.125 2023-06-24 04:13:16,103 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.921e+02 2.820e+02 3.152e+02 3.644e+02 6.101e+02, threshold=6.303e+02, percent-clipped=0.0 2023-06-24 04:14:08,350 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1019118.0, ans=0.07 2023-06-24 04:14:17,656 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1019118.0, ans=0.0 2023-06-24 04:14:22,699 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1019118.0, ans=0.0 2023-06-24 04:14:54,388 INFO [train.py:996] (2/4) Epoch 6, batch 17400, loss[loss=0.2096, simple_loss=0.2458, pruned_loss=0.08667, over 20049.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.3161, pruned_loss=0.08187, over 4276003.27 frames. ], batch size: 704, lr: 5.07e-03, grad_scale: 16.0 2023-06-24 04:15:04,134 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1019238.0, ans=0.1 2023-06-24 04:15:18,294 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1019298.0, ans=0.125 2023-06-24 04:15:21,749 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1019298.0, ans=0.125 2023-06-24 04:16:44,087 INFO [train.py:996] (2/4) Epoch 6, batch 17450, loss[loss=0.1901, simple_loss=0.2942, pruned_loss=0.04303, over 20736.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3119, pruned_loss=0.07902, over 4271650.02 frames. ], batch size: 608, lr: 5.07e-03, grad_scale: 16.0 2023-06-24 04:16:44,581 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1019538.0, ans=0.2 2023-06-24 04:16:48,534 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.78 vs. limit=15.0 2023-06-24 04:16:58,158 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.752e+02 2.361e+02 2.755e+02 3.366e+02 5.958e+02, threshold=5.511e+02, percent-clipped=0.0 2023-06-24 04:17:03,985 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1019598.0, ans=0.125 2023-06-24 04:17:37,811 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1019658.0, ans=0.125 2023-06-24 04:17:42,799 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1019658.0, ans=0.2 2023-06-24 04:17:45,267 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.74 vs. limit=10.0 2023-06-24 04:18:16,991 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1019778.0, ans=0.1 2023-06-24 04:18:30,652 INFO [train.py:996] (2/4) Epoch 6, batch 17500, loss[loss=0.2617, simple_loss=0.3148, pruned_loss=0.1043, over 21608.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3076, pruned_loss=0.07682, over 4275490.57 frames. ], batch size: 471, lr: 5.06e-03, grad_scale: 8.0 2023-06-24 04:18:43,745 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=1019838.0, ans=15.0 2023-06-24 04:19:14,123 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1019958.0, ans=0.0 2023-06-24 04:19:53,360 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1020018.0, ans=0.125 2023-06-24 04:19:55,096 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1020018.0, ans=0.0 2023-06-24 04:20:15,385 INFO [train.py:996] (2/4) Epoch 6, batch 17550, loss[loss=0.2174, simple_loss=0.3117, pruned_loss=0.06159, over 21739.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.306, pruned_loss=0.07479, over 4272817.73 frames. ], batch size: 247, lr: 5.06e-03, grad_scale: 8.0 2023-06-24 04:20:28,782 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.730e+02 2.219e+02 2.535e+02 2.795e+02 4.245e+02, threshold=5.070e+02, percent-clipped=0.0 2023-06-24 04:20:29,413 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1020138.0, ans=0.125 2023-06-24 04:20:32,874 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1020198.0, ans=0.1 2023-06-24 04:20:36,159 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1020198.0, ans=0.0 2023-06-24 04:21:24,290 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.15 vs. limit=15.0 2023-06-24 04:21:58,676 INFO [train.py:996] (2/4) Epoch 6, batch 17600, loss[loss=0.2302, simple_loss=0.3028, pruned_loss=0.07881, over 21640.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.3082, pruned_loss=0.07475, over 4262095.97 frames. ], batch size: 230, lr: 5.06e-03, grad_scale: 16.0 2023-06-24 04:22:09,853 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1020438.0, ans=0.125 2023-06-24 04:22:41,620 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1020498.0, ans=0.0 2023-06-24 04:23:48,337 INFO [train.py:996] (2/4) Epoch 6, batch 17650, loss[loss=0.2028, simple_loss=0.2784, pruned_loss=0.06366, over 21668.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.3067, pruned_loss=0.07573, over 4262292.04 frames. ], batch size: 351, lr: 5.06e-03, grad_scale: 16.0 2023-06-24 04:24:13,255 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.840e+02 2.481e+02 3.096e+02 4.210e+02 8.151e+02, threshold=6.192e+02, percent-clipped=13.0 2023-06-24 04:25:10,048 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.27 vs. limit=22.5 2023-06-24 04:25:18,620 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1020978.0, ans=0.2 2023-06-24 04:25:21,701 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1020978.0, ans=0.125 2023-06-24 04:25:21,815 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1020978.0, ans=0.05 2023-06-24 04:25:30,830 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1020978.0, ans=0.2 2023-06-24 04:25:32,724 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1020978.0, ans=0.1 2023-06-24 04:25:34,243 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1020978.0, ans=0.0 2023-06-24 04:25:42,587 INFO [train.py:996] (2/4) Epoch 6, batch 17700, loss[loss=0.2196, simple_loss=0.303, pruned_loss=0.06811, over 21649.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.3021, pruned_loss=0.0735, over 4256209.45 frames. ], batch size: 263, lr: 5.06e-03, grad_scale: 16.0 2023-06-24 04:26:00,762 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1021098.0, ans=0.125 2023-06-24 04:26:06,070 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1021098.0, ans=0.1 2023-06-24 04:26:35,646 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1021158.0, ans=0.0 2023-06-24 04:27:30,941 INFO [train.py:996] (2/4) Epoch 6, batch 17750, loss[loss=0.2695, simple_loss=0.3437, pruned_loss=0.09767, over 21384.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.3089, pruned_loss=0.07677, over 4258635.00 frames. ], batch size: 549, lr: 5.06e-03, grad_scale: 16.0 2023-06-24 04:27:36,996 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1021338.0, ans=0.0 2023-06-24 04:27:44,843 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.800e+02 2.598e+02 3.053e+02 3.567e+02 5.587e+02, threshold=6.107e+02, percent-clipped=0.0 2023-06-24 04:28:04,708 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1021398.0, ans=0.0 2023-06-24 04:28:47,824 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=1021518.0, ans=22.5 2023-06-24 04:29:11,099 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.26 vs. limit=10.0 2023-06-24 04:29:20,600 INFO [train.py:996] (2/4) Epoch 6, batch 17800, loss[loss=0.26, simple_loss=0.3399, pruned_loss=0.09003, over 21678.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3086, pruned_loss=0.07594, over 4258231.44 frames. ], batch size: 441, lr: 5.06e-03, grad_scale: 16.0 2023-06-24 04:29:52,451 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1021698.0, ans=10.0 2023-06-24 04:30:56,428 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1021878.0, ans=0.1 2023-06-24 04:31:20,146 INFO [train.py:996] (2/4) Epoch 6, batch 17850, loss[loss=0.2442, simple_loss=0.3118, pruned_loss=0.08825, over 21590.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.3095, pruned_loss=0.0766, over 4260482.89 frames. ], batch size: 263, lr: 5.06e-03, grad_scale: 8.0 2023-06-24 04:31:26,050 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1021938.0, ans=0.125 2023-06-24 04:31:35,790 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.909e+02 2.586e+02 3.040e+02 3.727e+02 6.886e+02, threshold=6.079e+02, percent-clipped=3.0 2023-06-24 04:31:52,777 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1021998.0, ans=0.0 2023-06-24 04:31:54,798 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1022058.0, ans=0.125 2023-06-24 04:32:10,623 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1022058.0, ans=0.125 2023-06-24 04:32:10,728 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1022058.0, ans=0.125 2023-06-24 04:32:29,383 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.44 vs. limit=12.0 2023-06-24 04:32:30,554 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1022118.0, ans=0.0 2023-06-24 04:32:42,813 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1022178.0, ans=0.2 2023-06-24 04:32:58,383 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.77 vs. limit=15.0 2023-06-24 04:33:10,775 INFO [train.py:996] (2/4) Epoch 6, batch 17900, loss[loss=0.2494, simple_loss=0.3479, pruned_loss=0.07549, over 21866.00 frames. ], tot_loss[loss=0.2352, simple_loss=0.3136, pruned_loss=0.07833, over 4261030.14 frames. ], batch size: 371, lr: 5.06e-03, grad_scale: 8.0 2023-06-24 04:33:31,210 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1022298.0, ans=0.125 2023-06-24 04:33:51,857 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=1022358.0, ans=10.0 2023-06-24 04:34:05,986 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1022358.0, ans=0.125 2023-06-24 04:34:12,908 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1022358.0, ans=0.125 2023-06-24 04:34:44,244 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1022478.0, ans=0.2 2023-06-24 04:35:01,043 INFO [train.py:996] (2/4) Epoch 6, batch 17950, loss[loss=0.2102, simple_loss=0.3038, pruned_loss=0.0583, over 21769.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3129, pruned_loss=0.07555, over 4256198.20 frames. ], batch size: 351, lr: 5.06e-03, grad_scale: 8.0 2023-06-24 04:35:16,332 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.647e+02 2.348e+02 2.616e+02 3.044e+02 5.736e+02, threshold=5.233e+02, percent-clipped=0.0 2023-06-24 04:35:48,854 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.38 vs. limit=15.0 2023-06-24 04:36:47,688 INFO [train.py:996] (2/4) Epoch 6, batch 18000, loss[loss=0.1863, simple_loss=0.2442, pruned_loss=0.06421, over 21290.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.3042, pruned_loss=0.07338, over 4257574.41 frames. ], batch size: 551, lr: 5.06e-03, grad_scale: 16.0 2023-06-24 04:36:47,689 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-24 04:37:05,826 INFO [train.py:1028] (2/4) Epoch 6, validation: loss=0.2648, simple_loss=0.3617, pruned_loss=0.08394, over 1796401.00 frames. 2023-06-24 04:37:05,827 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 23731MB 2023-06-24 04:37:47,992 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.74 vs. limit=10.0 2023-06-24 04:37:57,684 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1022958.0, ans=0.125 2023-06-24 04:38:06,733 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1022958.0, ans=0.2 2023-06-24 04:38:24,679 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.31 vs. limit=15.0 2023-06-24 04:38:25,906 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1023018.0, ans=0.125 2023-06-24 04:38:55,684 INFO [train.py:996] (2/4) Epoch 6, batch 18050, loss[loss=0.2115, simple_loss=0.2825, pruned_loss=0.0703, over 21762.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.2986, pruned_loss=0.07249, over 4269619.27 frames. ], batch size: 371, lr: 5.06e-03, grad_scale: 16.0 2023-06-24 04:39:22,857 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.791e+02 2.462e+02 2.761e+02 3.558e+02 5.314e+02, threshold=5.521e+02, percent-clipped=1.0 2023-06-24 04:40:05,139 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.56 vs. limit=15.0 2023-06-24 04:40:25,850 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1023318.0, ans=0.05 2023-06-24 04:40:46,675 INFO [train.py:996] (2/4) Epoch 6, batch 18100, loss[loss=0.2491, simple_loss=0.3294, pruned_loss=0.08435, over 21332.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.3044, pruned_loss=0.0746, over 4266877.80 frames. ], batch size: 159, lr: 5.06e-03, grad_scale: 16.0 2023-06-24 04:42:12,940 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 04:42:24,909 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1023678.0, ans=0.0 2023-06-24 04:42:30,655 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.90 vs. limit=6.0 2023-06-24 04:42:41,333 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1023738.0, ans=0.125 2023-06-24 04:42:42,402 INFO [train.py:996] (2/4) Epoch 6, batch 18150, loss[loss=0.2166, simple_loss=0.2822, pruned_loss=0.07548, over 21201.00 frames. ], tot_loss[loss=0.2271, simple_loss=0.3058, pruned_loss=0.07425, over 4265912.02 frames. ], batch size: 159, lr: 5.05e-03, grad_scale: 16.0 2023-06-24 04:42:59,744 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1023738.0, ans=0.0 2023-06-24 04:43:02,496 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.917e+02 2.411e+02 2.816e+02 3.524e+02 6.086e+02, threshold=5.632e+02, percent-clipped=3.0 2023-06-24 04:43:10,329 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1023798.0, ans=0.05 2023-06-24 04:44:18,492 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 04:44:22,765 INFO [train.py:996] (2/4) Epoch 6, batch 18200, loss[loss=0.1924, simple_loss=0.2675, pruned_loss=0.05869, over 21787.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.3013, pruned_loss=0.07442, over 4233616.52 frames. ], batch size: 102, lr: 5.05e-03, grad_scale: 16.0 2023-06-24 04:44:23,258 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1024038.0, ans=0.125 2023-06-24 04:44:38,489 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1024038.0, ans=0.07 2023-06-24 04:44:38,502 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1024038.0, ans=0.125 2023-06-24 04:46:07,536 INFO [train.py:996] (2/4) Epoch 6, batch 18250, loss[loss=0.2154, simple_loss=0.2857, pruned_loss=0.07251, over 21876.00 frames. ], tot_loss[loss=0.2183, simple_loss=0.293, pruned_loss=0.07182, over 4239334.25 frames. ], batch size: 351, lr: 5.05e-03, grad_scale: 16.0 2023-06-24 04:46:09,894 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1024338.0, ans=0.125 2023-06-24 04:46:23,379 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.690e+02 2.233e+02 2.540e+02 3.083e+02 5.311e+02, threshold=5.080e+02, percent-clipped=0.0 2023-06-24 04:46:38,817 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1024398.0, ans=0.125 2023-06-24 04:47:03,492 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1024458.0, ans=0.0 2023-06-24 04:47:57,596 INFO [train.py:996] (2/4) Epoch 6, batch 18300, loss[loss=0.2394, simple_loss=0.3406, pruned_loss=0.06911, over 21736.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.2954, pruned_loss=0.07278, over 4251043.03 frames. ], batch size: 298, lr: 5.05e-03, grad_scale: 16.0 2023-06-24 04:48:08,582 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 04:48:18,668 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1024698.0, ans=0.2 2023-06-24 04:49:02,115 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1024758.0, ans=0.1 2023-06-24 04:49:44,683 INFO [train.py:996] (2/4) Epoch 6, batch 18350, loss[loss=0.2053, simple_loss=0.2763, pruned_loss=0.06721, over 21353.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.301, pruned_loss=0.07296, over 4255284.10 frames. ], batch size: 194, lr: 5.05e-03, grad_scale: 16.0 2023-06-24 04:49:47,022 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1024938.0, ans=0.125 2023-06-24 04:50:00,362 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.116e+02 2.650e+02 3.163e+02 4.128e+02 7.474e+02, threshold=6.326e+02, percent-clipped=9.0 2023-06-24 04:50:01,129 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1024998.0, ans=0.0 2023-06-24 04:50:47,434 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.13 vs. limit=5.0 2023-06-24 04:51:26,391 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.60 vs. limit=10.0 2023-06-24 04:51:34,317 INFO [train.py:996] (2/4) Epoch 6, batch 18400, loss[loss=0.2197, simple_loss=0.2817, pruned_loss=0.07882, over 21319.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.2957, pruned_loss=0.07169, over 4247789.91 frames. ], batch size: 144, lr: 5.05e-03, grad_scale: 32.0 2023-06-24 04:53:17,800 INFO [train.py:996] (2/4) Epoch 6, batch 18450, loss[loss=0.1765, simple_loss=0.2555, pruned_loss=0.04874, over 21249.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.294, pruned_loss=0.06893, over 4253109.80 frames. ], batch size: 176, lr: 5.05e-03, grad_scale: 32.0 2023-06-24 04:53:33,306 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.665e+02 2.125e+02 2.326e+02 2.659e+02 4.995e+02, threshold=4.653e+02, percent-clipped=0.0 2023-06-24 04:53:46,489 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1025598.0, ans=0.0 2023-06-24 04:54:22,392 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1025658.0, ans=0.2 2023-06-24 04:54:41,221 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1025718.0, ans=0.2 2023-06-24 04:54:49,023 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.88 vs. limit=15.0 2023-06-24 04:55:06,497 INFO [train.py:996] (2/4) Epoch 6, batch 18500, loss[loss=0.2039, simple_loss=0.2686, pruned_loss=0.06962, over 21226.00 frames. ], tot_loss[loss=0.211, simple_loss=0.2881, pruned_loss=0.06699, over 4243819.48 frames. ], batch size: 159, lr: 5.05e-03, grad_scale: 32.0 2023-06-24 04:56:48,796 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.35 vs. limit=22.5 2023-06-24 04:56:52,816 INFO [train.py:996] (2/4) Epoch 6, batch 18550, loss[loss=0.2078, simple_loss=0.2706, pruned_loss=0.07253, over 21454.00 frames. ], tot_loss[loss=0.2085, simple_loss=0.2854, pruned_loss=0.06583, over 4231794.78 frames. ], batch size: 132, lr: 5.05e-03, grad_scale: 32.0 2023-06-24 04:56:58,685 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1026138.0, ans=0.1 2023-06-24 04:57:10,439 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.895e+02 2.481e+02 2.781e+02 3.235e+02 5.250e+02, threshold=5.562e+02, percent-clipped=2.0 2023-06-24 04:57:51,431 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1026258.0, ans=0.5 2023-06-24 04:58:04,026 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1026318.0, ans=0.2 2023-06-24 04:58:41,458 INFO [train.py:996] (2/4) Epoch 6, batch 18600, loss[loss=0.17, simple_loss=0.2478, pruned_loss=0.04614, over 21456.00 frames. ], tot_loss[loss=0.2096, simple_loss=0.2849, pruned_loss=0.06721, over 4235145.73 frames. ], batch size: 160, lr: 5.05e-03, grad_scale: 16.0 2023-06-24 04:58:44,347 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1026438.0, ans=0.1 2023-06-24 05:00:11,475 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1026678.0, ans=0.2 2023-06-24 05:00:29,910 INFO [train.py:996] (2/4) Epoch 6, batch 18650, loss[loss=0.1801, simple_loss=0.2493, pruned_loss=0.0555, over 21169.00 frames. ], tot_loss[loss=0.2105, simple_loss=0.2855, pruned_loss=0.06776, over 4229487.59 frames. ], batch size: 143, lr: 5.05e-03, grad_scale: 16.0 2023-06-24 05:00:46,766 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.811e+02 2.410e+02 2.665e+02 3.233e+02 6.336e+02, threshold=5.330e+02, percent-clipped=1.0 2023-06-24 05:01:21,351 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1026858.0, ans=0.2 2023-06-24 05:01:28,541 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.61 vs. limit=15.0 2023-06-24 05:01:39,774 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1026918.0, ans=0.1 2023-06-24 05:02:16,623 INFO [train.py:996] (2/4) Epoch 6, batch 18700, loss[loss=0.2124, simple_loss=0.2832, pruned_loss=0.07081, over 21832.00 frames. ], tot_loss[loss=0.2097, simple_loss=0.2821, pruned_loss=0.06866, over 4240503.03 frames. ], batch size: 107, lr: 5.05e-03, grad_scale: 16.0 2023-06-24 05:02:22,772 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.46 vs. limit=15.0 2023-06-24 05:02:23,930 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1027038.0, ans=0.125 2023-06-24 05:04:03,128 INFO [train.py:996] (2/4) Epoch 6, batch 18750, loss[loss=0.2533, simple_loss=0.3326, pruned_loss=0.08697, over 21759.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.285, pruned_loss=0.07177, over 4257928.34 frames. ], batch size: 332, lr: 5.05e-03, grad_scale: 8.0 2023-06-24 05:04:22,092 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.075e+02 2.422e+02 2.735e+02 3.202e+02 4.733e+02, threshold=5.471e+02, percent-clipped=0.0 2023-06-24 05:04:53,779 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1027458.0, ans=0.125 2023-06-24 05:04:53,796 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1027458.0, ans=0.1 2023-06-24 05:05:34,632 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.87 vs. limit=15.0 2023-06-24 05:05:46,381 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1027578.0, ans=0.5 2023-06-24 05:05:50,788 INFO [train.py:996] (2/4) Epoch 6, batch 18800, loss[loss=0.2046, simple_loss=0.2924, pruned_loss=0.05836, over 21681.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.2915, pruned_loss=0.07305, over 4258808.03 frames. ], batch size: 263, lr: 5.05e-03, grad_scale: 16.0 2023-06-24 05:07:09,915 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1027818.0, ans=0.0 2023-06-24 05:07:38,451 INFO [train.py:996] (2/4) Epoch 6, batch 18850, loss[loss=0.1633, simple_loss=0.2411, pruned_loss=0.04272, over 21312.00 frames. ], tot_loss[loss=0.2113, simple_loss=0.2867, pruned_loss=0.06797, over 4265805.49 frames. ], batch size: 159, lr: 5.04e-03, grad_scale: 16.0 2023-06-24 05:07:57,086 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.497e+02 2.192e+02 2.570e+02 2.921e+02 4.536e+02, threshold=5.140e+02, percent-clipped=0.0 2023-06-24 05:08:17,447 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff2.min_abs, batch_count=1027998.0, ans=0.1 2023-06-24 05:09:26,024 INFO [train.py:996] (2/4) Epoch 6, batch 18900, loss[loss=0.2177, simple_loss=0.2794, pruned_loss=0.07797, over 21793.00 frames. ], tot_loss[loss=0.2097, simple_loss=0.2834, pruned_loss=0.06801, over 4272961.29 frames. ], batch size: 416, lr: 5.04e-03, grad_scale: 16.0 2023-06-24 05:09:30,200 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1028238.0, ans=0.125 2023-06-24 05:09:40,380 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1028238.0, ans=0.0 2023-06-24 05:10:57,993 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.83 vs. limit=15.0 2023-06-24 05:11:03,615 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.20 vs. limit=15.0 2023-06-24 05:11:14,542 INFO [train.py:996] (2/4) Epoch 6, batch 18950, loss[loss=0.1863, simple_loss=0.2507, pruned_loss=0.06091, over 21166.00 frames. ], tot_loss[loss=0.213, simple_loss=0.2849, pruned_loss=0.07055, over 4276279.69 frames. ], batch size: 608, lr: 5.04e-03, grad_scale: 16.0 2023-06-24 05:11:17,188 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1028538.0, ans=0.125 2023-06-24 05:11:39,574 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.167e+02 2.683e+02 3.004e+02 3.629e+02 6.368e+02, threshold=6.008e+02, percent-clipped=2.0 2023-06-24 05:12:39,840 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1028718.0, ans=0.1 2023-06-24 05:12:43,483 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.06 vs. limit=10.0 2023-06-24 05:13:05,436 INFO [train.py:996] (2/4) Epoch 6, batch 19000, loss[loss=0.2408, simple_loss=0.3194, pruned_loss=0.08113, over 21380.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.2945, pruned_loss=0.07199, over 4272276.91 frames. ], batch size: 176, lr: 5.04e-03, grad_scale: 16.0 2023-06-24 05:13:06,793 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.07 vs. limit=10.0 2023-06-24 05:14:24,913 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1029018.0, ans=0.0 2023-06-24 05:14:34,033 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.33 vs. limit=15.0 2023-06-24 05:14:54,026 INFO [train.py:996] (2/4) Epoch 6, batch 19050, loss[loss=0.2436, simple_loss=0.3011, pruned_loss=0.09307, over 21519.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.2996, pruned_loss=0.07605, over 4272922.47 frames. ], batch size: 548, lr: 5.04e-03, grad_scale: 16.0 2023-06-24 05:15:19,303 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.099e+02 2.839e+02 3.291e+02 3.950e+02 6.159e+02, threshold=6.582e+02, percent-clipped=1.0 2023-06-24 05:16:22,012 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.95 vs. limit=15.0 2023-06-24 05:16:38,777 INFO [train.py:996] (2/4) Epoch 6, batch 19100, loss[loss=0.196, simple_loss=0.2615, pruned_loss=0.06527, over 21672.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.2974, pruned_loss=0.07676, over 4274533.73 frames. ], batch size: 282, lr: 5.04e-03, grad_scale: 16.0 2023-06-24 05:17:17,902 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1029498.0, ans=0.125 2023-06-24 05:18:36,316 INFO [train.py:996] (2/4) Epoch 6, batch 19150, loss[loss=0.2231, simple_loss=0.3141, pruned_loss=0.06607, over 21442.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.301, pruned_loss=0.0776, over 4278990.43 frames. ], batch size: 211, lr: 5.04e-03, grad_scale: 16.0 2023-06-24 05:19:12,556 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.104e+02 2.501e+02 2.737e+02 3.196e+02 5.229e+02, threshold=5.475e+02, percent-clipped=0.0 2023-06-24 05:19:24,884 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.18 vs. limit=10.0 2023-06-24 05:19:33,334 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1029858.0, ans=0.0 2023-06-24 05:19:52,808 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1029918.0, ans=0.1 2023-06-24 05:20:37,947 INFO [train.py:996] (2/4) Epoch 6, batch 19200, loss[loss=0.2945, simple_loss=0.393, pruned_loss=0.098, over 21638.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3121, pruned_loss=0.07863, over 4278646.01 frames. ], batch size: 441, lr: 5.04e-03, grad_scale: 16.0 2023-06-24 05:21:07,884 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1030098.0, ans=0.0 2023-06-24 05:21:26,959 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1030158.0, ans=0.125 2023-06-24 05:21:49,171 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1030218.0, ans=0.125 2023-06-24 05:21:52,562 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1030278.0, ans=0.1 2023-06-24 05:22:19,336 INFO [train.py:996] (2/4) Epoch 6, batch 19250, loss[loss=0.2164, simple_loss=0.3066, pruned_loss=0.06307, over 21725.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.3099, pruned_loss=0.07348, over 4265694.45 frames. ], batch size: 441, lr: 5.04e-03, grad_scale: 16.0 2023-06-24 05:22:31,391 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.76 vs. limit=22.5 2023-06-24 05:22:49,546 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1030398.0, ans=0.5 2023-06-24 05:22:50,491 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.504e+02 2.125e+02 2.467e+02 2.912e+02 4.275e+02, threshold=4.933e+02, percent-clipped=0.0 2023-06-24 05:23:07,309 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1030458.0, ans=0.0 2023-06-24 05:24:02,311 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1030578.0, ans=0.2 2023-06-24 05:24:12,795 INFO [train.py:996] (2/4) Epoch 6, batch 19300, loss[loss=0.2581, simple_loss=0.3147, pruned_loss=0.1007, over 21775.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.3064, pruned_loss=0.07344, over 4274962.07 frames. ], batch size: 441, lr: 5.04e-03, grad_scale: 16.0 2023-06-24 05:24:32,565 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1030638.0, ans=0.125 2023-06-24 05:24:34,787 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1030698.0, ans=0.125 2023-06-24 05:24:58,027 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.83 vs. limit=15.0 2023-06-24 05:25:02,843 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1030758.0, ans=0.0 2023-06-24 05:25:02,876 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1030758.0, ans=0.125 2023-06-24 05:25:08,129 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1030758.0, ans=0.0 2023-06-24 05:26:02,603 INFO [train.py:996] (2/4) Epoch 6, batch 19350, loss[loss=0.2578, simple_loss=0.3371, pruned_loss=0.08926, over 21563.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.301, pruned_loss=0.06922, over 4273767.30 frames. ], batch size: 473, lr: 5.04e-03, grad_scale: 16.0 2023-06-24 05:26:28,594 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.743e+02 2.277e+02 2.629e+02 3.333e+02 6.338e+02, threshold=5.259e+02, percent-clipped=7.0 2023-06-24 05:26:36,110 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1030998.0, ans=0.125 2023-06-24 05:26:41,250 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1031058.0, ans=0.125 2023-06-24 05:27:21,324 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1031178.0, ans=0.0 2023-06-24 05:27:50,231 INFO [train.py:996] (2/4) Epoch 6, batch 19400, loss[loss=0.2048, simple_loss=0.2754, pruned_loss=0.06714, over 21827.00 frames. ], tot_loss[loss=0.2192, simple_loss=0.3003, pruned_loss=0.06907, over 4268471.09 frames. ], batch size: 282, lr: 5.04e-03, grad_scale: 16.0 2023-06-24 05:27:58,475 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1031238.0, ans=0.0 2023-06-24 05:28:08,877 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1031238.0, ans=0.125 2023-06-24 05:28:17,659 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1031298.0, ans=0.0 2023-06-24 05:28:19,391 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1031298.0, ans=0.5 2023-06-24 05:28:19,968 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.81 vs. limit=15.0 2023-06-24 05:28:28,667 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.50 vs. limit=22.5 2023-06-24 05:28:42,810 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.39 vs. limit=6.0 2023-06-24 05:28:43,861 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1031358.0, ans=0.0 2023-06-24 05:28:58,382 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1031418.0, ans=0.2 2023-06-24 05:29:38,172 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1031478.0, ans=0.125 2023-06-24 05:29:44,568 INFO [train.py:996] (2/4) Epoch 6, batch 19450, loss[loss=0.2209, simple_loss=0.2891, pruned_loss=0.07638, over 21439.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.297, pruned_loss=0.07108, over 4264984.33 frames. ], batch size: 131, lr: 5.04e-03, grad_scale: 16.0 2023-06-24 05:30:05,472 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.971e+02 2.505e+02 2.907e+02 3.403e+02 7.011e+02, threshold=5.814e+02, percent-clipped=3.0 2023-06-24 05:30:12,109 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.59 vs. limit=22.5 2023-06-24 05:30:20,382 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1031658.0, ans=0.0 2023-06-24 05:30:27,286 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1031658.0, ans=0.125 2023-06-24 05:31:06,955 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1031778.0, ans=0.125 2023-06-24 05:31:17,921 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1031778.0, ans=0.0 2023-06-24 05:31:29,141 INFO [train.py:996] (2/4) Epoch 6, batch 19500, loss[loss=0.2808, simple_loss=0.3375, pruned_loss=0.112, over 21422.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.2932, pruned_loss=0.0721, over 4261774.06 frames. ], batch size: 507, lr: 5.04e-03, grad_scale: 16.0 2023-06-24 05:31:44,617 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1031838.0, ans=0.0 2023-06-24 05:32:32,726 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1032018.0, ans=0.125 2023-06-24 05:32:35,986 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1032018.0, ans=0.125 2023-06-24 05:33:12,951 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1032078.0, ans=0.125 2023-06-24 05:33:17,784 INFO [train.py:996] (2/4) Epoch 6, batch 19550, loss[loss=0.2004, simple_loss=0.3016, pruned_loss=0.04964, over 21753.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2889, pruned_loss=0.07091, over 4238942.01 frames. ], batch size: 332, lr: 5.03e-03, grad_scale: 16.0 2023-06-24 05:33:21,777 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1032138.0, ans=0.07 2023-06-24 05:33:37,932 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.061e+02 2.721e+02 3.147e+02 3.714e+02 5.540e+02, threshold=6.293e+02, percent-clipped=0.0 2023-06-24 05:34:37,561 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1032378.0, ans=0.2 2023-06-24 05:35:04,146 INFO [train.py:996] (2/4) Epoch 6, batch 19600, loss[loss=0.2584, simple_loss=0.3291, pruned_loss=0.09389, over 21903.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2908, pruned_loss=0.0717, over 4250589.07 frames. ], batch size: 107, lr: 5.03e-03, grad_scale: 32.0 2023-06-24 05:35:15,213 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1032438.0, ans=0.125 2023-06-24 05:35:22,376 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 05:35:57,213 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1032558.0, ans=0.0 2023-06-24 05:35:58,041 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.74 vs. limit=15.0 2023-06-24 05:36:20,416 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1032618.0, ans=0.0 2023-06-24 05:36:45,470 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=6.73 vs. limit=22.5 2023-06-24 05:36:53,285 INFO [train.py:996] (2/4) Epoch 6, batch 19650, loss[loss=0.2427, simple_loss=0.3195, pruned_loss=0.08297, over 21444.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.2962, pruned_loss=0.07532, over 4255723.90 frames. ], batch size: 131, lr: 5.03e-03, grad_scale: 16.0 2023-06-24 05:37:16,214 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.065e+02 2.599e+02 2.881e+02 3.237e+02 5.731e+02, threshold=5.762e+02, percent-clipped=0.0 2023-06-24 05:38:07,209 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1032918.0, ans=0.125 2023-06-24 05:38:09,478 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.72 vs. limit=10.0 2023-06-24 05:38:25,552 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1032978.0, ans=0.0 2023-06-24 05:38:45,150 INFO [train.py:996] (2/4) Epoch 6, batch 19700, loss[loss=0.2237, simple_loss=0.3127, pruned_loss=0.06733, over 21699.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.2982, pruned_loss=0.07459, over 4256786.47 frames. ], batch size: 351, lr: 5.03e-03, grad_scale: 16.0 2023-06-24 05:40:34,422 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1033338.0, ans=0.2 2023-06-24 05:40:35,423 INFO [train.py:996] (2/4) Epoch 6, batch 19750, loss[loss=0.2508, simple_loss=0.3386, pruned_loss=0.0815, over 21768.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.3093, pruned_loss=0.07682, over 4258743.98 frames. ], batch size: 298, lr: 5.03e-03, grad_scale: 8.0 2023-06-24 05:40:54,732 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1033338.0, ans=0.125 2023-06-24 05:40:59,894 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1033398.0, ans=0.125 2023-06-24 05:41:09,208 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.111e+02 2.723e+02 3.338e+02 4.190e+02 5.879e+02, threshold=6.676e+02, percent-clipped=1.0 2023-06-24 05:42:11,116 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1033578.0, ans=0.1 2023-06-24 05:42:16,291 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1033578.0, ans=0.125 2023-06-24 05:42:22,742 INFO [train.py:996] (2/4) Epoch 6, batch 19800, loss[loss=0.2364, simple_loss=0.3139, pruned_loss=0.07944, over 21666.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.3071, pruned_loss=0.07677, over 4262141.37 frames. ], batch size: 441, lr: 5.03e-03, grad_scale: 8.0 2023-06-24 05:42:43,252 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1033638.0, ans=0.0 2023-06-24 05:42:57,383 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.60 vs. limit=6.0 2023-06-24 05:43:34,038 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1033758.0, ans=0.125 2023-06-24 05:43:43,436 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1033818.0, ans=0.125 2023-06-24 05:43:45,005 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1033818.0, ans=0.125 2023-06-24 05:44:13,580 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1033878.0, ans=0.125 2023-06-24 05:44:17,979 INFO [train.py:996] (2/4) Epoch 6, batch 19850, loss[loss=0.1674, simple_loss=0.2326, pruned_loss=0.0511, over 21821.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.2995, pruned_loss=0.07184, over 4265103.60 frames. ], batch size: 102, lr: 5.03e-03, grad_scale: 8.0 2023-06-24 05:44:28,688 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1033938.0, ans=0.125 2023-06-24 05:44:52,062 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.808e+02 2.320e+02 2.642e+02 2.979e+02 5.130e+02, threshold=5.285e+02, percent-clipped=0.0 2023-06-24 05:45:06,593 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1034058.0, ans=0.2 2023-06-24 05:45:09,735 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1034058.0, ans=0.1 2023-06-24 05:45:16,338 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1034058.0, ans=0.125 2023-06-24 05:45:21,617 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1034058.0, ans=0.0 2023-06-24 05:46:03,650 INFO [train.py:996] (2/4) Epoch 6, batch 19900, loss[loss=0.2134, simple_loss=0.2754, pruned_loss=0.07572, over 21183.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.2994, pruned_loss=0.0694, over 4269636.60 frames. ], batch size: 176, lr: 5.03e-03, grad_scale: 8.0 2023-06-24 05:46:53,536 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1034358.0, ans=0.125 2023-06-24 05:47:05,925 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1034358.0, ans=0.125 2023-06-24 05:47:53,325 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1034478.0, ans=0.125 2023-06-24 05:47:57,275 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1034538.0, ans=0.0 2023-06-24 05:47:58,276 INFO [train.py:996] (2/4) Epoch 6, batch 19950, loss[loss=0.1854, simple_loss=0.2688, pruned_loss=0.05099, over 21740.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.293, pruned_loss=0.06919, over 4261343.48 frames. ], batch size: 316, lr: 5.03e-03, grad_scale: 8.0 2023-06-24 05:48:33,574 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.795e+02 2.312e+02 2.767e+02 3.263e+02 6.271e+02, threshold=5.533e+02, percent-clipped=3.0 2023-06-24 05:49:10,961 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1034718.0, ans=0.07 2023-06-24 05:49:46,369 INFO [train.py:996] (2/4) Epoch 6, batch 20000, loss[loss=0.2181, simple_loss=0.2942, pruned_loss=0.07099, over 21642.00 frames. ], tot_loss[loss=0.2164, simple_loss=0.2936, pruned_loss=0.06958, over 4265169.42 frames. ], batch size: 263, lr: 5.03e-03, grad_scale: 16.0 2023-06-24 05:50:20,441 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1034898.0, ans=0.0 2023-06-24 05:50:37,689 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1034958.0, ans=0.0 2023-06-24 05:51:22,594 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1035078.0, ans=0.0 2023-06-24 05:51:33,389 INFO [train.py:996] (2/4) Epoch 6, batch 20050, loss[loss=0.2458, simple_loss=0.3127, pruned_loss=0.08943, over 21803.00 frames. ], tot_loss[loss=0.2199, simple_loss=0.2956, pruned_loss=0.0721, over 4270188.70 frames. ], batch size: 414, lr: 5.03e-03, grad_scale: 16.0 2023-06-24 05:51:40,760 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1035138.0, ans=0.125 2023-06-24 05:51:45,023 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.60 vs. limit=15.0 2023-06-24 05:51:54,622 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1035198.0, ans=0.0 2023-06-24 05:52:08,117 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.887e+02 2.657e+02 2.915e+02 3.243e+02 4.793e+02, threshold=5.831e+02, percent-clipped=0.0 2023-06-24 05:52:24,746 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1035258.0, ans=0.125 2023-06-24 05:52:27,940 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1035258.0, ans=0.2 2023-06-24 05:53:19,056 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1035378.0, ans=0.125 2023-06-24 05:53:23,706 INFO [train.py:996] (2/4) Epoch 6, batch 20100, loss[loss=0.1948, simple_loss=0.261, pruned_loss=0.06428, over 21246.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.2976, pruned_loss=0.07469, over 4282770.61 frames. ], batch size: 608, lr: 5.03e-03, grad_scale: 16.0 2023-06-24 05:53:26,169 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1035438.0, ans=0.0 2023-06-24 05:53:42,715 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.49 vs. limit=15.0 2023-06-24 05:53:44,645 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.51 vs. limit=10.0 2023-06-24 05:53:52,762 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1035498.0, ans=0.1 2023-06-24 05:54:15,579 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 05:54:24,808 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1035558.0, ans=0.07 2023-06-24 05:54:40,855 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1035618.0, ans=0.1 2023-06-24 05:55:20,100 INFO [train.py:996] (2/4) Epoch 6, batch 20150, loss[loss=0.2797, simple_loss=0.4008, pruned_loss=0.07933, over 20817.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3084, pruned_loss=0.0787, over 4277676.16 frames. ], batch size: 607, lr: 5.03e-03, grad_scale: 16.0 2023-06-24 05:55:41,658 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1035798.0, ans=0.125 2023-06-24 05:55:43,540 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1035798.0, ans=0.0 2023-06-24 05:55:46,197 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.227e+02 2.880e+02 3.455e+02 4.017e+02 7.640e+02, threshold=6.911e+02, percent-clipped=4.0 2023-06-24 05:56:08,912 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1035858.0, ans=0.125 2023-06-24 05:56:28,371 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1035918.0, ans=0.125 2023-06-24 05:56:53,868 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1035978.0, ans=0.0 2023-06-24 05:57:12,794 INFO [train.py:996] (2/4) Epoch 6, batch 20200, loss[loss=0.2588, simple_loss=0.3639, pruned_loss=0.07683, over 20797.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.315, pruned_loss=0.08146, over 4276798.01 frames. ], batch size: 607, lr: 5.02e-03, grad_scale: 16.0 2023-06-24 05:58:18,220 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1036158.0, ans=0.125 2023-06-24 05:58:53,892 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1036278.0, ans=0.125 2023-06-24 05:59:01,890 INFO [train.py:996] (2/4) Epoch 6, batch 20250, loss[loss=0.2274, simple_loss=0.3154, pruned_loss=0.06965, over 21812.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.3157, pruned_loss=0.07959, over 4280080.20 frames. ], batch size: 351, lr: 5.02e-03, grad_scale: 16.0 2023-06-24 05:59:06,004 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1036338.0, ans=0.125 2023-06-24 05:59:23,959 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1036398.0, ans=0.1 2023-06-24 05:59:25,780 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1036398.0, ans=0.0 2023-06-24 05:59:26,779 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.892e+02 2.473e+02 2.856e+02 3.579e+02 8.091e+02, threshold=5.711e+02, percent-clipped=1.0 2023-06-24 05:59:36,475 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1036398.0, ans=0.035 2023-06-24 05:59:38,837 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.18 vs. limit=15.0 2023-06-24 06:00:43,821 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1036578.0, ans=0.125 2023-06-24 06:00:49,981 INFO [train.py:996] (2/4) Epoch 6, batch 20300, loss[loss=0.2296, simple_loss=0.3316, pruned_loss=0.06378, over 20828.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.3116, pruned_loss=0.07653, over 4269948.04 frames. ], batch size: 608, lr: 5.02e-03, grad_scale: 16.0 2023-06-24 06:00:51,209 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=1036638.0, ans=10.0 2023-06-24 06:01:11,484 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.89 vs. limit=15.0 2023-06-24 06:02:33,238 INFO [train.py:996] (2/4) Epoch 6, batch 20350, loss[loss=0.2439, simple_loss=0.311, pruned_loss=0.08839, over 21851.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3104, pruned_loss=0.07656, over 4254106.61 frames. ], batch size: 332, lr: 5.02e-03, grad_scale: 16.0 2023-06-24 06:02:56,832 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.914e+02 2.320e+02 2.555e+02 2.973e+02 6.061e+02, threshold=5.110e+02, percent-clipped=1.0 2023-06-24 06:02:57,822 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1036998.0, ans=0.0 2023-06-24 06:03:27,866 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.63 vs. limit=10.0 2023-06-24 06:03:38,897 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1037118.0, ans=0.1 2023-06-24 06:04:04,412 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.94 vs. limit=10.0 2023-06-24 06:04:20,801 INFO [train.py:996] (2/4) Epoch 6, batch 20400, loss[loss=0.2708, simple_loss=0.3445, pruned_loss=0.09857, over 21346.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.3134, pruned_loss=0.07937, over 4263370.67 frames. ], batch size: 548, lr: 5.02e-03, grad_scale: 32.0 2023-06-24 06:04:25,727 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.33 vs. limit=15.0 2023-06-24 06:04:26,804 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1037238.0, ans=0.125 2023-06-24 06:05:03,455 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1037298.0, ans=0.5 2023-06-24 06:05:16,970 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1037358.0, ans=0.125 2023-06-24 06:05:18,806 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1037358.0, ans=0.0 2023-06-24 06:05:25,553 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1037358.0, ans=0.0 2023-06-24 06:05:32,291 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1037418.0, ans=0.0 2023-06-24 06:05:51,881 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1037478.0, ans=0.125 2023-06-24 06:06:04,336 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.44 vs. limit=15.0 2023-06-24 06:06:08,185 INFO [train.py:996] (2/4) Epoch 6, batch 20450, loss[loss=0.2043, simple_loss=0.2484, pruned_loss=0.08003, over 19986.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.316, pruned_loss=0.08238, over 4261102.17 frames. ], batch size: 704, lr: 5.02e-03, grad_scale: 32.0 2023-06-24 06:06:14,064 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1037538.0, ans=0.0 2023-06-24 06:06:29,014 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1037598.0, ans=0.1 2023-06-24 06:06:31,828 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.242e+02 2.916e+02 3.328e+02 3.687e+02 5.878e+02, threshold=6.655e+02, percent-clipped=5.0 2023-06-24 06:06:58,131 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1037658.0, ans=0.1 2023-06-24 06:07:19,769 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1037718.0, ans=0.1 2023-06-24 06:07:35,997 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1037778.0, ans=0.1 2023-06-24 06:07:54,270 INFO [train.py:996] (2/4) Epoch 6, batch 20500, loss[loss=0.2244, simple_loss=0.2953, pruned_loss=0.07673, over 21341.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3108, pruned_loss=0.0819, over 4261118.83 frames. ], batch size: 176, lr: 5.02e-03, grad_scale: 32.0 2023-06-24 06:08:00,292 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1037838.0, ans=0.125 2023-06-24 06:08:10,957 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.51 vs. limit=15.0 2023-06-24 06:09:00,245 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1038018.0, ans=0.125 2023-06-24 06:09:00,281 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1038018.0, ans=0.2 2023-06-24 06:09:00,735 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.55 vs. limit=15.0 2023-06-24 06:09:14,572 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1038018.0, ans=0.125 2023-06-24 06:09:41,883 INFO [train.py:996] (2/4) Epoch 6, batch 20550, loss[loss=0.2083, simple_loss=0.2929, pruned_loss=0.0619, over 21568.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.3018, pruned_loss=0.07965, over 4261432.10 frames. ], batch size: 263, lr: 5.02e-03, grad_scale: 32.0 2023-06-24 06:10:01,997 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1038198.0, ans=0.0 2023-06-24 06:10:06,159 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.939e+02 2.636e+02 3.017e+02 3.648e+02 5.396e+02, threshold=6.035e+02, percent-clipped=0.0 2023-06-24 06:10:16,941 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1038198.0, ans=0.125 2023-06-24 06:10:28,794 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1038198.0, ans=0.125 2023-06-24 06:10:43,198 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=8.99 vs. limit=12.0 2023-06-24 06:10:54,971 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1038318.0, ans=0.2 2023-06-24 06:10:55,008 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1038318.0, ans=0.1 2023-06-24 06:10:59,223 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.08 vs. limit=15.0 2023-06-24 06:11:25,340 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1038378.0, ans=0.0 2023-06-24 06:11:29,439 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.66 vs. limit=15.0 2023-06-24 06:11:29,780 INFO [train.py:996] (2/4) Epoch 6, batch 20600, loss[loss=0.2433, simple_loss=0.318, pruned_loss=0.0843, over 21531.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.3039, pruned_loss=0.07756, over 4249336.83 frames. ], batch size: 548, lr: 5.02e-03, grad_scale: 32.0 2023-06-24 06:12:07,267 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=1038498.0, ans=0.05 2023-06-24 06:12:29,955 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=1038558.0, ans=10.0 2023-06-24 06:12:56,236 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1038678.0, ans=0.0 2023-06-24 06:13:10,686 INFO [train.py:996] (2/4) Epoch 6, batch 20650, loss[loss=0.2107, simple_loss=0.2734, pruned_loss=0.07403, over 21289.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.3008, pruned_loss=0.07801, over 4249317.16 frames. ], batch size: 143, lr: 5.02e-03, grad_scale: 32.0 2023-06-24 06:13:40,726 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.859e+02 2.453e+02 2.852e+02 3.486e+02 6.346e+02, threshold=5.704e+02, percent-clipped=1.0 2023-06-24 06:14:01,825 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1038858.0, ans=0.125 2023-06-24 06:14:05,509 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1038858.0, ans=0.2 2023-06-24 06:14:36,695 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.67 vs. limit=15.0 2023-06-24 06:14:38,190 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1038918.0, ans=0.07 2023-06-24 06:15:00,376 INFO [train.py:996] (2/4) Epoch 6, batch 20700, loss[loss=0.1954, simple_loss=0.278, pruned_loss=0.05641, over 21706.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.2941, pruned_loss=0.07514, over 4255446.60 frames. ], batch size: 298, lr: 5.02e-03, grad_scale: 16.0 2023-06-24 06:15:06,623 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1039038.0, ans=0.125 2023-06-24 06:15:08,202 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1039038.0, ans=0.125 2023-06-24 06:15:17,143 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1039098.0, ans=0.125 2023-06-24 06:16:49,953 INFO [train.py:996] (2/4) Epoch 6, batch 20750, loss[loss=0.2925, simple_loss=0.3932, pruned_loss=0.09596, over 21666.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.2957, pruned_loss=0.07386, over 4254694.10 frames. ], batch size: 389, lr: 5.02e-03, grad_scale: 16.0 2023-06-24 06:17:37,392 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.847e+02 2.434e+02 2.945e+02 4.112e+02 9.661e+02, threshold=5.891e+02, percent-clipped=8.0 2023-06-24 06:18:10,760 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1039518.0, ans=0.125 2023-06-24 06:18:24,712 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1039578.0, ans=0.125 2023-06-24 06:18:43,222 INFO [train.py:996] (2/4) Epoch 6, batch 20800, loss[loss=0.2065, simple_loss=0.2667, pruned_loss=0.07311, over 21423.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.2985, pruned_loss=0.07459, over 4255611.70 frames. ], batch size: 160, lr: 5.02e-03, grad_scale: 32.0 2023-06-24 06:18:48,648 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1039638.0, ans=0.125 2023-06-24 06:19:38,590 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1039758.0, ans=0.2 2023-06-24 06:19:38,653 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1039758.0, ans=0.0 2023-06-24 06:19:43,844 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1039758.0, ans=0.2 2023-06-24 06:20:26,316 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1039878.0, ans=0.125 2023-06-24 06:20:29,183 INFO [train.py:996] (2/4) Epoch 6, batch 20850, loss[loss=0.1542, simple_loss=0.2321, pruned_loss=0.03815, over 21658.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.2934, pruned_loss=0.07307, over 4255468.96 frames. ], batch size: 247, lr: 5.02e-03, grad_scale: 32.0 2023-06-24 06:20:33,114 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1039938.0, ans=0.125 2023-06-24 06:20:33,199 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff3.min_abs, batch_count=1039938.0, ans=0.2 2023-06-24 06:20:33,255 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1039938.0, ans=0.125 2023-06-24 06:20:36,875 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1039938.0, ans=0.125 2023-06-24 06:21:06,161 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.891e+02 2.402e+02 2.795e+02 3.449e+02 6.931e+02, threshold=5.589e+02, percent-clipped=4.0 2023-06-24 06:21:32,280 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1040058.0, ans=0.2 2023-06-24 06:21:44,825 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1040118.0, ans=0.07 2023-06-24 06:22:18,929 INFO [train.py:996] (2/4) Epoch 6, batch 20900, loss[loss=0.2121, simple_loss=0.2926, pruned_loss=0.06577, over 21599.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.2946, pruned_loss=0.07431, over 4268451.89 frames. ], batch size: 263, lr: 5.01e-03, grad_scale: 32.0 2023-06-24 06:23:04,259 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1040298.0, ans=0.1 2023-06-24 06:23:19,880 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1040358.0, ans=0.125 2023-06-24 06:23:30,607 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.90 vs. limit=15.0 2023-06-24 06:23:52,084 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1040478.0, ans=0.2 2023-06-24 06:24:04,697 INFO [train.py:996] (2/4) Epoch 6, batch 20950, loss[loss=0.2346, simple_loss=0.3014, pruned_loss=0.0839, over 21775.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.2904, pruned_loss=0.07095, over 4258365.96 frames. ], batch size: 414, lr: 5.01e-03, grad_scale: 32.0 2023-06-24 06:24:08,601 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1040538.0, ans=0.0 2023-06-24 06:24:37,833 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.62 vs. limit=15.0 2023-06-24 06:24:40,111 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.703e+02 2.258e+02 2.758e+02 3.294e+02 6.843e+02, threshold=5.516e+02, percent-clipped=1.0 2023-06-24 06:25:34,946 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.07 vs. limit=22.5 2023-06-24 06:25:50,789 INFO [train.py:996] (2/4) Epoch 6, batch 21000, loss[loss=0.245, simple_loss=0.3019, pruned_loss=0.09401, over 21575.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2887, pruned_loss=0.07087, over 4262301.75 frames. ], batch size: 548, lr: 5.01e-03, grad_scale: 32.0 2023-06-24 06:25:50,790 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-24 06:26:08,837 INFO [train.py:1028] (2/4) Epoch 6, validation: loss=0.2672, simple_loss=0.3654, pruned_loss=0.08451, over 1796401.00 frames. 2023-06-24 06:26:08,838 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 23731MB 2023-06-24 06:26:25,262 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.63 vs. limit=15.0 2023-06-24 06:26:28,100 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1040838.0, ans=0.125 2023-06-24 06:27:31,218 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1041078.0, ans=0.125 2023-06-24 06:27:36,847 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.59 vs. limit=15.0 2023-06-24 06:27:47,880 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1041078.0, ans=0.2 2023-06-24 06:27:50,630 INFO [train.py:996] (2/4) Epoch 6, batch 21050, loss[loss=0.2042, simple_loss=0.2715, pruned_loss=0.06846, over 21321.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2872, pruned_loss=0.07148, over 4256997.66 frames. ], batch size: 144, lr: 5.01e-03, grad_scale: 16.0 2023-06-24 06:27:58,207 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1041138.0, ans=0.125 2023-06-24 06:28:23,275 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.974e+02 2.469e+02 2.621e+02 3.007e+02 4.225e+02, threshold=5.242e+02, percent-clipped=0.0 2023-06-24 06:28:44,783 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1041258.0, ans=0.0 2023-06-24 06:28:52,525 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.51 vs. limit=15.0 2023-06-24 06:28:53,475 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1041318.0, ans=0.1 2023-06-24 06:28:56,928 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1041318.0, ans=0.125 2023-06-24 06:29:32,198 INFO [train.py:996] (2/4) Epoch 6, batch 21100, loss[loss=0.2105, simple_loss=0.2707, pruned_loss=0.07511, over 21828.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.2838, pruned_loss=0.07116, over 4260131.26 frames. ], batch size: 318, lr: 5.01e-03, grad_scale: 16.0 2023-06-24 06:29:33,675 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.65 vs. limit=15.0 2023-06-24 06:29:34,909 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1041438.0, ans=0.09899494936611666 2023-06-24 06:29:56,939 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1041438.0, ans=0.125 2023-06-24 06:29:58,956 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1041498.0, ans=0.125 2023-06-24 06:30:00,680 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1041498.0, ans=0.2 2023-06-24 06:30:56,608 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1041678.0, ans=0.2 2023-06-24 06:31:01,879 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1041678.0, ans=10.0 2023-06-24 06:31:09,253 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.65 vs. limit=15.0 2023-06-24 06:31:10,364 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1041678.0, ans=0.125 2023-06-24 06:31:20,318 INFO [train.py:996] (2/4) Epoch 6, batch 21150, loss[loss=0.2173, simple_loss=0.3005, pruned_loss=0.06701, over 16120.00 frames. ], tot_loss[loss=0.2122, simple_loss=0.2807, pruned_loss=0.07182, over 4248460.83 frames. ], batch size: 62, lr: 5.01e-03, grad_scale: 16.0 2023-06-24 06:31:40,511 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.25 vs. limit=22.5 2023-06-24 06:31:42,401 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.59 vs. limit=15.0 2023-06-24 06:31:54,018 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.92 vs. limit=10.0 2023-06-24 06:32:03,769 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.773e+02 2.519e+02 2.928e+02 4.378e+02 7.241e+02, threshold=5.856e+02, percent-clipped=12.0 2023-06-24 06:32:20,205 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.40 vs. limit=15.0 2023-06-24 06:32:21,025 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1041858.0, ans=0.035 2023-06-24 06:32:21,087 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1041858.0, ans=0.0 2023-06-24 06:32:57,086 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff3.min_abs, batch_count=1041978.0, ans=0.2 2023-06-24 06:33:01,383 INFO [train.py:996] (2/4) Epoch 6, batch 21200, loss[loss=0.1965, simple_loss=0.2636, pruned_loss=0.06466, over 21406.00 frames. ], tot_loss[loss=0.21, simple_loss=0.2781, pruned_loss=0.07093, over 4243339.65 frames. ], batch size: 131, lr: 5.01e-03, grad_scale: 16.0 2023-06-24 06:33:35,454 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.98 vs. limit=15.0 2023-06-24 06:34:49,519 INFO [train.py:996] (2/4) Epoch 6, batch 21250, loss[loss=0.1922, simple_loss=0.2563, pruned_loss=0.06409, over 21716.00 frames. ], tot_loss[loss=0.2097, simple_loss=0.2766, pruned_loss=0.0714, over 4244631.45 frames. ], batch size: 124, lr: 5.01e-03, grad_scale: 16.0 2023-06-24 06:35:16,515 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.55 vs. limit=15.0 2023-06-24 06:35:16,547 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.17 vs. limit=10.0 2023-06-24 06:35:33,822 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.037e+02 2.602e+02 2.917e+02 3.308e+02 4.858e+02, threshold=5.834e+02, percent-clipped=0.0 2023-06-24 06:35:50,845 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1042458.0, ans=0.125 2023-06-24 06:36:07,639 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1042518.0, ans=0.125 2023-06-24 06:36:29,337 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.49 vs. limit=15.0 2023-06-24 06:36:36,716 INFO [train.py:996] (2/4) Epoch 6, batch 21300, loss[loss=0.2186, simple_loss=0.2927, pruned_loss=0.07224, over 21494.00 frames. ], tot_loss[loss=0.2158, simple_loss=0.2837, pruned_loss=0.07398, over 4249707.43 frames. ], batch size: 212, lr: 5.01e-03, grad_scale: 16.0 2023-06-24 06:37:35,905 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1042758.0, ans=0.125 2023-06-24 06:37:44,662 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1042758.0, ans=0.125 2023-06-24 06:37:56,857 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.51 vs. limit=15.0 2023-06-24 06:38:20,382 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1042878.0, ans=0.0 2023-06-24 06:38:28,063 INFO [train.py:996] (2/4) Epoch 6, batch 21350, loss[loss=0.2176, simple_loss=0.2914, pruned_loss=0.07192, over 21803.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.2879, pruned_loss=0.07488, over 4263922.79 frames. ], batch size: 112, lr: 5.01e-03, grad_scale: 16.0 2023-06-24 06:39:13,503 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.957e+02 2.459e+02 2.698e+02 3.098e+02 4.551e+02, threshold=5.397e+02, percent-clipped=0.0 2023-06-24 06:39:46,118 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1043118.0, ans=0.1 2023-06-24 06:40:18,939 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.33 vs. limit=15.0 2023-06-24 06:40:27,151 INFO [train.py:996] (2/4) Epoch 6, batch 21400, loss[loss=0.209, simple_loss=0.2644, pruned_loss=0.07679, over 20218.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.2915, pruned_loss=0.07411, over 4263111.33 frames. ], batch size: 703, lr: 5.01e-03, grad_scale: 16.0 2023-06-24 06:40:58,513 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1043298.0, ans=0.2 2023-06-24 06:41:36,192 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1043418.0, ans=0.125 2023-06-24 06:41:40,220 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.19 vs. limit=22.5 2023-06-24 06:42:15,527 INFO [train.py:996] (2/4) Epoch 6, batch 21450, loss[loss=0.238, simple_loss=0.3071, pruned_loss=0.08444, over 21801.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.2957, pruned_loss=0.07598, over 4270742.43 frames. ], batch size: 441, lr: 5.01e-03, grad_scale: 16.0 2023-06-24 06:42:47,489 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.66 vs. limit=22.5 2023-06-24 06:42:49,433 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.972e+02 2.545e+02 3.012e+02 3.537e+02 6.506e+02, threshold=6.024e+02, percent-clipped=2.0 2023-06-24 06:43:05,885 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1043658.0, ans=0.0 2023-06-24 06:43:29,702 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1043778.0, ans=0.125 2023-06-24 06:43:45,720 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.64 vs. limit=22.5 2023-06-24 06:43:46,801 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1043778.0, ans=0.0 2023-06-24 06:44:02,135 INFO [train.py:996] (2/4) Epoch 6, batch 21500, loss[loss=0.2167, simple_loss=0.2843, pruned_loss=0.07453, over 21986.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.2944, pruned_loss=0.07691, over 4269870.91 frames. ], batch size: 103, lr: 5.01e-03, grad_scale: 16.0 2023-06-24 06:44:15,691 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.94 vs. limit=15.0 2023-06-24 06:44:20,385 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1043898.0, ans=0.125 2023-06-24 06:45:50,207 INFO [train.py:996] (2/4) Epoch 6, batch 21550, loss[loss=0.191, simple_loss=0.2596, pruned_loss=0.06119, over 21690.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.2883, pruned_loss=0.07448, over 4260701.65 frames. ], batch size: 333, lr: 5.01e-03, grad_scale: 8.0 2023-06-24 06:46:26,728 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.014e+02 2.541e+02 2.913e+02 3.487e+02 5.320e+02, threshold=5.826e+02, percent-clipped=0.0 2023-06-24 06:47:39,516 INFO [train.py:996] (2/4) Epoch 6, batch 21600, loss[loss=0.2193, simple_loss=0.2866, pruned_loss=0.07597, over 21983.00 frames. ], tot_loss[loss=0.2142, simple_loss=0.2833, pruned_loss=0.07252, over 4251342.41 frames. ], batch size: 103, lr: 5.00e-03, grad_scale: 16.0 2023-06-24 06:47:49,994 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.28 vs. limit=12.0 2023-06-24 06:47:51,177 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1044438.0, ans=0.125 2023-06-24 06:48:05,947 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.28 vs. limit=15.0 2023-06-24 06:49:27,634 INFO [train.py:996] (2/4) Epoch 6, batch 21650, loss[loss=0.2048, simple_loss=0.2925, pruned_loss=0.05854, over 21403.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2858, pruned_loss=0.07045, over 4248116.08 frames. ], batch size: 131, lr: 5.00e-03, grad_scale: 16.0 2023-06-24 06:49:28,626 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.34 vs. limit=15.0 2023-06-24 06:49:48,921 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1044798.0, ans=0.0 2023-06-24 06:49:52,698 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1044798.0, ans=0.125 2023-06-24 06:49:56,112 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1044798.0, ans=0.1 2023-06-24 06:50:03,886 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.981e+02 2.515e+02 2.797e+02 3.244e+02 5.540e+02, threshold=5.595e+02, percent-clipped=0.0 2023-06-24 06:51:14,469 INFO [train.py:996] (2/4) Epoch 6, batch 21700, loss[loss=0.1968, simple_loss=0.2825, pruned_loss=0.05553, over 21713.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.2861, pruned_loss=0.06816, over 4252956.62 frames. ], batch size: 298, lr: 5.00e-03, grad_scale: 16.0 2023-06-24 06:51:15,123 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1045038.0, ans=0.125 2023-06-24 06:51:28,633 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1045038.0, ans=0.125 2023-06-24 06:51:41,209 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1045098.0, ans=0.05 2023-06-24 06:51:42,816 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1045098.0, ans=0.125 2023-06-24 06:51:42,876 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1045098.0, ans=0.125 2023-06-24 06:53:00,659 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.07 vs. limit=10.0 2023-06-24 06:53:01,227 INFO [train.py:996] (2/4) Epoch 6, batch 21750, loss[loss=0.2008, simple_loss=0.2713, pruned_loss=0.06514, over 15502.00 frames. ], tot_loss[loss=0.2088, simple_loss=0.2823, pruned_loss=0.06769, over 4242921.51 frames. ], batch size: 60, lr: 5.00e-03, grad_scale: 16.0 2023-06-24 06:53:36,794 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1045398.0, ans=0.2 2023-06-24 06:53:37,710 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.726e+02 2.476e+02 2.744e+02 3.259e+02 4.826e+02, threshold=5.488e+02, percent-clipped=0.0 2023-06-24 06:54:00,682 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1045518.0, ans=0.1 2023-06-24 06:54:07,968 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1045518.0, ans=0.0 2023-06-24 06:54:32,375 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1045578.0, ans=0.2 2023-06-24 06:54:49,973 INFO [train.py:996] (2/4) Epoch 6, batch 21800, loss[loss=0.2084, simple_loss=0.2818, pruned_loss=0.06748, over 21273.00 frames. ], tot_loss[loss=0.2095, simple_loss=0.2805, pruned_loss=0.06926, over 4245088.87 frames. ], batch size: 176, lr: 5.00e-03, grad_scale: 16.0 2023-06-24 06:55:30,013 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1045758.0, ans=0.125 2023-06-24 06:55:38,966 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1045758.0, ans=0.125 2023-06-24 06:55:42,417 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1045758.0, ans=0.0 2023-06-24 06:56:08,495 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1045818.0, ans=0.125 2023-06-24 06:56:39,657 INFO [train.py:996] (2/4) Epoch 6, batch 21850, loss[loss=0.2281, simple_loss=0.3232, pruned_loss=0.06654, over 21798.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.2877, pruned_loss=0.07094, over 4258043.69 frames. ], batch size: 351, lr: 5.00e-03, grad_scale: 16.0 2023-06-24 06:57:00,406 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.59 vs. limit=15.0 2023-06-24 06:57:14,587 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.38 vs. limit=15.0 2023-06-24 06:57:16,748 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.124e+02 2.510e+02 2.889e+02 3.463e+02 5.314e+02, threshold=5.778e+02, percent-clipped=0.0 2023-06-24 06:57:21,051 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1046058.0, ans=0.125 2023-06-24 06:58:05,074 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1046118.0, ans=0.2 2023-06-24 06:58:25,038 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1046178.0, ans=0.0 2023-06-24 06:58:26,683 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1046238.0, ans=0.125 2023-06-24 06:58:27,751 INFO [train.py:996] (2/4) Epoch 6, batch 21900, loss[loss=0.2334, simple_loss=0.2945, pruned_loss=0.0862, over 21574.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2894, pruned_loss=0.07225, over 4261526.89 frames. ], batch size: 391, lr: 5.00e-03, grad_scale: 16.0 2023-06-24 06:58:36,182 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1046238.0, ans=0.0 2023-06-24 06:58:42,798 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1046238.0, ans=0.0 2023-06-24 06:59:43,180 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1046418.0, ans=0.0 2023-06-24 07:00:22,009 INFO [train.py:996] (2/4) Epoch 6, batch 21950, loss[loss=0.1464, simple_loss=0.2323, pruned_loss=0.03022, over 21555.00 frames. ], tot_loss[loss=0.2118, simple_loss=0.2829, pruned_loss=0.07033, over 4256125.94 frames. ], batch size: 230, lr: 5.00e-03, grad_scale: 16.0 2023-06-24 07:00:31,173 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1046538.0, ans=0.125 2023-06-24 07:00:53,154 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.690e+02 2.424e+02 2.913e+02 3.468e+02 5.833e+02, threshold=5.826e+02, percent-clipped=1.0 2023-06-24 07:01:02,749 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.84 vs. limit=15.0 2023-06-24 07:01:56,559 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=1046778.0, ans=10.0 2023-06-24 07:02:10,404 INFO [train.py:996] (2/4) Epoch 6, batch 22000, loss[loss=0.1783, simple_loss=0.2512, pruned_loss=0.05269, over 21680.00 frames. ], tot_loss[loss=0.2048, simple_loss=0.2763, pruned_loss=0.06672, over 4261632.70 frames. ], batch size: 282, lr: 5.00e-03, grad_scale: 32.0 2023-06-24 07:02:44,946 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1046958.0, ans=0.125 2023-06-24 07:03:02,063 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1046958.0, ans=0.2 2023-06-24 07:03:07,821 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1047018.0, ans=0.2 2023-06-24 07:03:26,451 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.70 vs. limit=12.0 2023-06-24 07:03:42,450 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.93 vs. limit=22.5 2023-06-24 07:03:45,105 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1047078.0, ans=0.0 2023-06-24 07:03:59,914 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1047138.0, ans=0.125 2023-06-24 07:04:00,834 INFO [train.py:996] (2/4) Epoch 6, batch 22050, loss[loss=0.2824, simple_loss=0.3556, pruned_loss=0.1046, over 21719.00 frames. ], tot_loss[loss=0.2087, simple_loss=0.2803, pruned_loss=0.06856, over 4256933.48 frames. ], batch size: 441, lr: 5.00e-03, grad_scale: 32.0 2023-06-24 07:04:39,712 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.688e+02 2.375e+02 2.787e+02 3.407e+02 5.897e+02, threshold=5.574e+02, percent-clipped=1.0 2023-06-24 07:04:41,079 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=1047258.0, ans=22.5 2023-06-24 07:04:52,418 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1047258.0, ans=0.125 2023-06-24 07:04:54,787 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=1047258.0, ans=22.5 2023-06-24 07:05:11,528 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1047318.0, ans=0.05 2023-06-24 07:05:48,289 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1047438.0, ans=0.125 2023-06-24 07:05:48,329 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1047438.0, ans=0.125 2023-06-24 07:05:49,361 INFO [train.py:996] (2/4) Epoch 6, batch 22100, loss[loss=0.2479, simple_loss=0.3233, pruned_loss=0.08623, over 21866.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.2932, pruned_loss=0.07385, over 4258866.69 frames. ], batch size: 371, lr: 5.00e-03, grad_scale: 16.0 2023-06-24 07:06:02,712 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.17 vs. limit=15.0 2023-06-24 07:07:23,841 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1047678.0, ans=0.0 2023-06-24 07:07:32,045 INFO [train.py:996] (2/4) Epoch 6, batch 22150, loss[loss=0.2288, simple_loss=0.3088, pruned_loss=0.07443, over 21462.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.2954, pruned_loss=0.07517, over 4267312.65 frames. ], batch size: 194, lr: 5.00e-03, grad_scale: 16.0 2023-06-24 07:08:10,539 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.155e+02 2.700e+02 3.228e+02 3.782e+02 5.741e+02, threshold=6.456e+02, percent-clipped=1.0 2023-06-24 07:08:21,882 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1047858.0, ans=0.0 2023-06-24 07:08:23,540 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1047858.0, ans=0.125 2023-06-24 07:09:21,406 INFO [train.py:996] (2/4) Epoch 6, batch 22200, loss[loss=0.2349, simple_loss=0.3105, pruned_loss=0.07967, over 17579.00 frames. ], tot_loss[loss=0.2248, simple_loss=0.2971, pruned_loss=0.07625, over 4268055.40 frames. ], batch size: 61, lr: 5.00e-03, grad_scale: 16.0 2023-06-24 07:10:23,895 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1048158.0, ans=0.125 2023-06-24 07:11:09,188 INFO [train.py:996] (2/4) Epoch 6, batch 22250, loss[loss=0.3067, simple_loss=0.3635, pruned_loss=0.125, over 21468.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.3048, pruned_loss=0.07792, over 4278898.46 frames. ], batch size: 471, lr: 5.00e-03, grad_scale: 16.0 2023-06-24 07:11:46,677 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.112e+02 2.521e+02 2.836e+02 3.368e+02 6.817e+02, threshold=5.671e+02, percent-clipped=1.0 2023-06-24 07:12:19,853 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1048518.0, ans=0.125 2023-06-24 07:12:55,400 INFO [train.py:996] (2/4) Epoch 6, batch 22300, loss[loss=0.2412, simple_loss=0.3047, pruned_loss=0.08883, over 21231.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.3058, pruned_loss=0.07956, over 4286681.25 frames. ], batch size: 143, lr: 4.99e-03, grad_scale: 16.0 2023-06-24 07:13:24,500 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.63 vs. limit=15.0 2023-06-24 07:13:30,381 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1048698.0, ans=0.025 2023-06-24 07:14:33,586 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1048878.0, ans=0.125 2023-06-24 07:14:38,224 INFO [train.py:996] (2/4) Epoch 6, batch 22350, loss[loss=0.2631, simple_loss=0.3167, pruned_loss=0.1048, over 21837.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.3038, pruned_loss=0.08034, over 4298061.77 frames. ], batch size: 441, lr: 4.99e-03, grad_scale: 16.0 2023-06-24 07:15:15,681 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.270e+02 2.647e+02 2.993e+02 3.483e+02 5.422e+02, threshold=5.987e+02, percent-clipped=0.0 2023-06-24 07:15:35,590 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1049058.0, ans=0.125 2023-06-24 07:15:37,100 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1049058.0, ans=0.125 2023-06-24 07:15:59,869 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1049118.0, ans=0.125 2023-06-24 07:16:20,244 INFO [train.py:996] (2/4) Epoch 6, batch 22400, loss[loss=0.202, simple_loss=0.2749, pruned_loss=0.06454, over 21894.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3007, pruned_loss=0.07751, over 4297398.56 frames. ], batch size: 107, lr: 4.99e-03, grad_scale: 32.0 2023-06-24 07:16:35,893 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.20 vs. limit=12.0 2023-06-24 07:17:15,193 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1049358.0, ans=0.07 2023-06-24 07:18:07,119 INFO [train.py:996] (2/4) Epoch 6, batch 22450, loss[loss=0.195, simple_loss=0.2556, pruned_loss=0.06724, over 21452.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.2946, pruned_loss=0.07582, over 4299511.43 frames. ], batch size: 212, lr: 4.99e-03, grad_scale: 16.0 2023-06-24 07:18:35,835 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.18 vs. limit=15.0 2023-06-24 07:18:45,247 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.79 vs. limit=22.5 2023-06-24 07:18:51,849 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1049658.0, ans=0.0 2023-06-24 07:18:52,755 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.816e+02 2.523e+02 2.860e+02 3.590e+02 5.659e+02, threshold=5.720e+02, percent-clipped=0.0 2023-06-24 07:18:57,129 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1049658.0, ans=0.125 2023-06-24 07:19:15,416 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1049718.0, ans=0.07 2023-06-24 07:19:21,846 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1049718.0, ans=0.04949747468305833 2023-06-24 07:19:34,204 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1049778.0, ans=0.1 2023-06-24 07:19:50,748 INFO [train.py:996] (2/4) Epoch 6, batch 22500, loss[loss=0.2312, simple_loss=0.2789, pruned_loss=0.0917, over 20035.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.2887, pruned_loss=0.07522, over 4285952.95 frames. ], batch size: 702, lr: 4.99e-03, grad_scale: 16.0 2023-06-24 07:20:12,952 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.38 vs. limit=6.0 2023-06-24 07:20:15,809 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 07:21:35,549 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 07:21:40,282 INFO [train.py:996] (2/4) Epoch 6, batch 22550, loss[loss=0.2291, simple_loss=0.3124, pruned_loss=0.07288, over 21035.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.2939, pruned_loss=0.07569, over 4285230.47 frames. ], batch size: 607, lr: 4.99e-03, grad_scale: 16.0 2023-06-24 07:21:40,894 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1050138.0, ans=0.125 2023-06-24 07:22:24,475 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.00 vs. limit=15.0 2023-06-24 07:22:32,442 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.214e+02 2.690e+02 3.328e+02 4.292e+02 7.428e+02, threshold=6.656e+02, percent-clipped=5.0 2023-06-24 07:22:34,871 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1050258.0, ans=0.2 2023-06-24 07:22:34,889 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 07:22:38,733 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1050258.0, ans=0.125 2023-06-24 07:22:52,001 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.42 vs. limit=15.0 2023-06-24 07:23:30,377 INFO [train.py:996] (2/4) Epoch 6, batch 22600, loss[loss=0.2943, simple_loss=0.3728, pruned_loss=0.1079, over 21509.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.2975, pruned_loss=0.07639, over 4289498.69 frames. ], batch size: 471, lr: 4.99e-03, grad_scale: 16.0 2023-06-24 07:23:39,943 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1050438.0, ans=0.0 2023-06-24 07:23:48,421 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1050438.0, ans=0.1 2023-06-24 07:23:50,245 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_ff2.min_abs, batch_count=1050438.0, ans=0.1 2023-06-24 07:24:00,969 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.10 vs. limit=6.0 2023-06-24 07:24:15,064 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1050498.0, ans=0.125 2023-06-24 07:24:25,718 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1050558.0, ans=0.1 2023-06-24 07:24:31,126 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1050558.0, ans=0.0 2023-06-24 07:25:23,770 INFO [train.py:996] (2/4) Epoch 6, batch 22650, loss[loss=0.2366, simple_loss=0.3519, pruned_loss=0.06058, over 19817.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.2951, pruned_loss=0.07576, over 4286901.00 frames. ], batch size: 703, lr: 4.99e-03, grad_scale: 16.0 2023-06-24 07:26:04,996 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1050798.0, ans=0.125 2023-06-24 07:26:07,811 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.240e+02 2.713e+02 2.934e+02 3.379e+02 4.768e+02, threshold=5.868e+02, percent-clipped=0.0 2023-06-24 07:26:25,548 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1050918.0, ans=0.125 2023-06-24 07:26:46,117 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1050978.0, ans=0.2 2023-06-24 07:26:46,611 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.50 vs. limit=12.0 2023-06-24 07:27:04,016 INFO [train.py:996] (2/4) Epoch 6, batch 22700, loss[loss=0.2253, simple_loss=0.2769, pruned_loss=0.08683, over 21505.00 frames. ], tot_loss[loss=0.22, simple_loss=0.2888, pruned_loss=0.0756, over 4285212.85 frames. ], batch size: 442, lr: 4.99e-03, grad_scale: 16.0 2023-06-24 07:27:30,410 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1051098.0, ans=0.0 2023-06-24 07:28:24,051 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.92 vs. limit=15.0 2023-06-24 07:28:46,860 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1051278.0, ans=0.1 2023-06-24 07:28:56,587 INFO [train.py:996] (2/4) Epoch 6, batch 22750, loss[loss=0.2471, simple_loss=0.3168, pruned_loss=0.08871, over 21383.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.2901, pruned_loss=0.07737, over 4276050.53 frames. ], batch size: 549, lr: 4.99e-03, grad_scale: 16.0 2023-06-24 07:29:37,175 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1051398.0, ans=0.125 2023-06-24 07:29:41,502 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.135e+02 2.658e+02 2.967e+02 3.229e+02 5.067e+02, threshold=5.933e+02, percent-clipped=0.0 2023-06-24 07:29:54,678 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1051458.0, ans=0.125 2023-06-24 07:29:54,738 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1051458.0, ans=0.1 2023-06-24 07:30:30,731 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1051578.0, ans=0.125 2023-06-24 07:30:46,684 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1051578.0, ans=0.125 2023-06-24 07:30:49,466 INFO [train.py:996] (2/4) Epoch 6, batch 22800, loss[loss=0.2489, simple_loss=0.3063, pruned_loss=0.09572, over 21603.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.2939, pruned_loss=0.07975, over 4286815.25 frames. ], batch size: 471, lr: 4.99e-03, grad_scale: 32.0 2023-06-24 07:31:00,439 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1051638.0, ans=0.125 2023-06-24 07:31:40,552 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1051758.0, ans=0.05 2023-06-24 07:32:31,133 INFO [train.py:996] (2/4) Epoch 6, batch 22850, loss[loss=0.2087, simple_loss=0.2893, pruned_loss=0.06402, over 22034.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.2903, pruned_loss=0.07799, over 4283496.33 frames. ], batch size: 103, lr: 4.99e-03, grad_scale: 16.0 2023-06-24 07:32:35,191 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1051938.0, ans=0.125 2023-06-24 07:33:14,546 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.102e+02 2.514e+02 2.935e+02 3.337e+02 4.796e+02, threshold=5.870e+02, percent-clipped=0.0 2023-06-24 07:34:16,567 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1052178.0, ans=0.2 2023-06-24 07:34:22,983 INFO [train.py:996] (2/4) Epoch 6, batch 22900, loss[loss=0.2874, simple_loss=0.4026, pruned_loss=0.0861, over 19758.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.2941, pruned_loss=0.07738, over 4277012.53 frames. ], batch size: 702, lr: 4.99e-03, grad_scale: 16.0 2023-06-24 07:34:58,461 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1052298.0, ans=0.125 2023-06-24 07:35:04,092 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1052358.0, ans=0.1 2023-06-24 07:35:52,651 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.59 vs. limit=15.0 2023-06-24 07:36:02,411 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1052478.0, ans=0.0 2023-06-24 07:36:14,167 INFO [train.py:996] (2/4) Epoch 6, batch 22950, loss[loss=0.2333, simple_loss=0.373, pruned_loss=0.04677, over 20764.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.3077, pruned_loss=0.07567, over 4272711.82 frames. ], batch size: 608, lr: 4.99e-03, grad_scale: 16.0 2023-06-24 07:36:56,203 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.951e+02 2.392e+02 2.733e+02 3.196e+02 4.909e+02, threshold=5.466e+02, percent-clipped=0.0 2023-06-24 07:37:04,052 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1052658.0, ans=0.05 2023-06-24 07:37:30,951 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.71 vs. limit=10.0 2023-06-24 07:38:02,384 INFO [train.py:996] (2/4) Epoch 6, batch 23000, loss[loss=0.2129, simple_loss=0.2922, pruned_loss=0.06674, over 21505.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.3063, pruned_loss=0.07375, over 4272527.98 frames. ], batch size: 131, lr: 4.98e-03, grad_scale: 16.0 2023-06-24 07:38:25,425 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.62 vs. limit=15.0 2023-06-24 07:39:21,089 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1053018.0, ans=0.04949747468305833 2023-06-24 07:39:30,169 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1053018.0, ans=0.025 2023-06-24 07:39:35,555 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1053078.0, ans=0.0 2023-06-24 07:39:47,964 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1053078.0, ans=0.2 2023-06-24 07:39:58,084 INFO [train.py:996] (2/4) Epoch 6, batch 23050, loss[loss=0.2441, simple_loss=0.3209, pruned_loss=0.08362, over 21815.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3072, pruned_loss=0.0757, over 4279982.37 frames. ], batch size: 118, lr: 4.98e-03, grad_scale: 16.0 2023-06-24 07:40:41,051 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.131e+02 2.622e+02 2.848e+02 3.330e+02 6.770e+02, threshold=5.696e+02, percent-clipped=1.0 2023-06-24 07:41:48,450 INFO [train.py:996] (2/4) Epoch 6, batch 23100, loss[loss=0.2289, simple_loss=0.2933, pruned_loss=0.08221, over 15361.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.3031, pruned_loss=0.07652, over 4266135.40 frames. ], batch size: 62, lr: 4.98e-03, grad_scale: 16.0 2023-06-24 07:41:54,735 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1053438.0, ans=0.0 2023-06-24 07:42:04,784 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1053498.0, ans=0.04949747468305833 2023-06-24 07:42:13,381 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1053498.0, ans=0.04949747468305833 2023-06-24 07:42:39,316 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1053558.0, ans=0.1 2023-06-24 07:42:58,707 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1053618.0, ans=0.0 2023-06-24 07:43:36,009 INFO [train.py:996] (2/4) Epoch 6, batch 23150, loss[loss=0.2135, simple_loss=0.2847, pruned_loss=0.07114, over 21829.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.2963, pruned_loss=0.07527, over 4274670.03 frames. ], batch size: 332, lr: 4.98e-03, grad_scale: 16.0 2023-06-24 07:44:16,352 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.069e+02 2.510e+02 2.867e+02 3.300e+02 5.681e+02, threshold=5.734e+02, percent-clipped=0.0 2023-06-24 07:44:50,984 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1053918.0, ans=0.2 2023-06-24 07:44:59,500 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1053978.0, ans=0.1 2023-06-24 07:45:09,648 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1053978.0, ans=0.0 2023-06-24 07:45:15,935 INFO [train.py:996] (2/4) Epoch 6, batch 23200, loss[loss=0.1945, simple_loss=0.2618, pruned_loss=0.06366, over 21726.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.2949, pruned_loss=0.07593, over 4284069.88 frames. ], batch size: 230, lr: 4.98e-03, grad_scale: 32.0 2023-06-24 07:45:32,532 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1054038.0, ans=0.04949747468305833 2023-06-24 07:46:00,776 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1054158.0, ans=0.125 2023-06-24 07:46:46,831 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.03 vs. limit=6.0 2023-06-24 07:46:48,469 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.93 vs. limit=12.0 2023-06-24 07:47:02,643 INFO [train.py:996] (2/4) Epoch 6, batch 23250, loss[loss=0.2223, simple_loss=0.2799, pruned_loss=0.08236, over 21576.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.2944, pruned_loss=0.07746, over 4291917.51 frames. ], batch size: 548, lr: 4.98e-03, grad_scale: 32.0 2023-06-24 07:47:56,066 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.962e+02 2.641e+02 2.998e+02 3.541e+02 5.576e+02, threshold=5.996e+02, percent-clipped=0.0 2023-06-24 07:48:07,400 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1054458.0, ans=0.125 2023-06-24 07:48:07,977 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.83 vs. limit=15.0 2023-06-24 07:48:35,880 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1054578.0, ans=0.125 2023-06-24 07:48:58,019 INFO [train.py:996] (2/4) Epoch 6, batch 23300, loss[loss=0.2803, simple_loss=0.3895, pruned_loss=0.08561, over 21839.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.3038, pruned_loss=0.07947, over 4291271.38 frames. ], batch size: 371, lr: 4.98e-03, grad_scale: 16.0 2023-06-24 07:49:11,930 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.52 vs. limit=15.0 2023-06-24 07:49:20,313 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.04 vs. limit=15.0 2023-06-24 07:50:32,305 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=1054878.0, ans=15.0 2023-06-24 07:50:46,433 INFO [train.py:996] (2/4) Epoch 6, batch 23350, loss[loss=0.2147, simple_loss=0.3117, pruned_loss=0.0589, over 20827.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3082, pruned_loss=0.07858, over 4290394.50 frames. ], batch size: 607, lr: 4.98e-03, grad_scale: 16.0 2023-06-24 07:50:48,729 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1054938.0, ans=0.1 2023-06-24 07:51:20,045 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.19 vs. limit=6.0 2023-06-24 07:51:41,620 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.696e+02 2.545e+02 3.075e+02 3.480e+02 4.848e+02, threshold=6.150e+02, percent-clipped=0.0 2023-06-24 07:51:44,819 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.74 vs. limit=6.0 2023-06-24 07:52:34,879 INFO [train.py:996] (2/4) Epoch 6, batch 23400, loss[loss=0.2151, simple_loss=0.2805, pruned_loss=0.07479, over 21348.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.3014, pruned_loss=0.07493, over 4294367.05 frames. ], batch size: 176, lr: 4.98e-03, grad_scale: 16.0 2023-06-24 07:53:19,107 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 07:54:08,520 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1055478.0, ans=0.1 2023-06-24 07:54:33,575 INFO [train.py:996] (2/4) Epoch 6, batch 23450, loss[loss=0.2944, simple_loss=0.349, pruned_loss=0.1199, over 21441.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.3028, pruned_loss=0.07764, over 4298253.72 frames. ], batch size: 471, lr: 4.98e-03, grad_scale: 16.0 2023-06-24 07:55:11,698 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.07 vs. limit=22.5 2023-06-24 07:55:17,216 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.107e+02 2.530e+02 2.834e+02 3.227e+02 5.088e+02, threshold=5.668e+02, percent-clipped=0.0 2023-06-24 07:55:47,667 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1055718.0, ans=0.125 2023-06-24 07:55:57,771 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1055778.0, ans=0.2 2023-06-24 07:56:13,660 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.90 vs. limit=15.0 2023-06-24 07:56:20,857 INFO [train.py:996] (2/4) Epoch 6, batch 23500, loss[loss=0.1634, simple_loss=0.2209, pruned_loss=0.05295, over 19919.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.3021, pruned_loss=0.07932, over 4289949.78 frames. ], batch size: 703, lr: 4.98e-03, grad_scale: 16.0 2023-06-24 07:56:24,841 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1055838.0, ans=0.125 2023-06-24 07:57:01,398 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1055958.0, ans=0.0 2023-06-24 07:57:07,879 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1055958.0, ans=0.125 2023-06-24 07:57:08,496 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.89 vs. limit=22.5 2023-06-24 07:57:09,874 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1055958.0, ans=0.0 2023-06-24 07:57:52,347 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1056078.0, ans=0.0 2023-06-24 07:58:09,295 INFO [train.py:996] (2/4) Epoch 6, batch 23550, loss[loss=0.1916, simple_loss=0.2524, pruned_loss=0.0654, over 21554.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.2962, pruned_loss=0.07878, over 4288996.92 frames. ], batch size: 212, lr: 4.98e-03, grad_scale: 16.0 2023-06-24 07:58:31,071 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_na.min_abs, batch_count=1056198.0, ans=0.02 2023-06-24 07:58:46,431 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1056198.0, ans=0.125 2023-06-24 07:58:52,556 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.092e+02 2.611e+02 2.905e+02 3.629e+02 5.861e+02, threshold=5.811e+02, percent-clipped=1.0 2023-06-24 07:59:05,294 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1056318.0, ans=0.0 2023-06-24 07:59:23,326 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1056318.0, ans=0.0 2023-06-24 07:59:28,525 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1056378.0, ans=0.0 2023-06-24 07:59:57,762 INFO [train.py:996] (2/4) Epoch 6, batch 23600, loss[loss=0.2379, simple_loss=0.3129, pruned_loss=0.08143, over 21552.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.2984, pruned_loss=0.07919, over 4292174.59 frames. ], batch size: 389, lr: 4.98e-03, grad_scale: 32.0 2023-06-24 08:00:13,757 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1056438.0, ans=0.125 2023-06-24 08:00:26,319 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1056498.0, ans=0.05 2023-06-24 08:01:51,993 INFO [train.py:996] (2/4) Epoch 6, batch 23650, loss[loss=0.2323, simple_loss=0.3118, pruned_loss=0.0764, over 21453.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.2984, pruned_loss=0.07665, over 4282974.31 frames. ], batch size: 131, lr: 4.98e-03, grad_scale: 16.0 2023-06-24 08:02:38,196 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.907e+02 2.630e+02 3.092e+02 3.541e+02 6.593e+02, threshold=6.183e+02, percent-clipped=1.0 2023-06-24 08:03:03,213 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1056918.0, ans=0.125 2023-06-24 08:03:12,144 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1056918.0, ans=0.1 2023-06-24 08:03:40,807 INFO [train.py:996] (2/4) Epoch 6, batch 23700, loss[loss=0.2366, simple_loss=0.3077, pruned_loss=0.08272, over 19925.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.3001, pruned_loss=0.07587, over 4278048.38 frames. ], batch size: 704, lr: 4.97e-03, grad_scale: 16.0 2023-06-24 08:03:45,296 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1057038.0, ans=0.125 2023-06-24 08:04:27,088 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1057158.0, ans=0.125 2023-06-24 08:04:38,868 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1057158.0, ans=0.07 2023-06-24 08:05:22,303 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1057278.0, ans=0.125 2023-06-24 08:05:23,067 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.71 vs. limit=15.0 2023-06-24 08:05:31,882 INFO [train.py:996] (2/4) Epoch 6, batch 23750, loss[loss=0.2029, simple_loss=0.2995, pruned_loss=0.05314, over 21765.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.3043, pruned_loss=0.07679, over 4278109.22 frames. ], batch size: 298, lr: 4.97e-03, grad_scale: 16.0 2023-06-24 08:05:36,119 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 08:05:48,093 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1057398.0, ans=0.0 2023-06-24 08:06:07,680 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1057398.0, ans=0.125 2023-06-24 08:06:26,818 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.851e+02 2.305e+02 2.862e+02 3.715e+02 6.571e+02, threshold=5.724e+02, percent-clipped=1.0 2023-06-24 08:07:13,024 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1057578.0, ans=0.125 2023-06-24 08:07:21,350 INFO [train.py:996] (2/4) Epoch 6, batch 23800, loss[loss=0.2207, simple_loss=0.3042, pruned_loss=0.06858, over 21357.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.3011, pruned_loss=0.07472, over 4280144.77 frames. ], batch size: 211, lr: 4.97e-03, grad_scale: 16.0 2023-06-24 08:07:44,773 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1057698.0, ans=0.0 2023-06-24 08:08:30,793 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.70 vs. limit=22.5 2023-06-24 08:08:39,128 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1057818.0, ans=0.1 2023-06-24 08:08:39,154 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1057818.0, ans=0.5 2023-06-24 08:08:47,218 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.22 vs. limit=10.0 2023-06-24 08:09:18,107 INFO [train.py:996] (2/4) Epoch 6, batch 23850, loss[loss=0.251, simple_loss=0.3231, pruned_loss=0.0894, over 21342.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3124, pruned_loss=0.07812, over 4286267.07 frames. ], batch size: 549, lr: 4.97e-03, grad_scale: 16.0 2023-06-24 08:10:09,050 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.30 vs. limit=15.0 2023-06-24 08:10:10,305 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1058058.0, ans=0.0 2023-06-24 08:10:14,729 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.170e+02 2.790e+02 3.206e+02 3.794e+02 6.982e+02, threshold=6.412e+02, percent-clipped=2.0 2023-06-24 08:11:12,060 INFO [train.py:996] (2/4) Epoch 6, batch 23900, loss[loss=0.2037, simple_loss=0.2849, pruned_loss=0.06126, over 21489.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.3187, pruned_loss=0.07974, over 4289634.82 frames. ], batch size: 230, lr: 4.97e-03, grad_scale: 16.0 2023-06-24 08:11:47,428 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.02 vs. limit=10.0 2023-06-24 08:11:57,003 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1058358.0, ans=0.0 2023-06-24 08:13:00,271 INFO [train.py:996] (2/4) Epoch 6, batch 23950, loss[loss=0.217, simple_loss=0.2779, pruned_loss=0.07809, over 21609.00 frames. ], tot_loss[loss=0.2352, simple_loss=0.3118, pruned_loss=0.07927, over 4283093.28 frames. ], batch size: 247, lr: 4.97e-03, grad_scale: 16.0 2023-06-24 08:13:39,956 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.69 vs. limit=15.0 2023-06-24 08:13:49,716 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1058658.0, ans=0.1 2023-06-24 08:13:52,884 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.147e+02 2.675e+02 3.021e+02 3.458e+02 5.557e+02, threshold=6.041e+02, percent-clipped=0.0 2023-06-24 08:14:23,536 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1058718.0, ans=0.125 2023-06-24 08:14:53,453 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.73 vs. limit=15.0 2023-06-24 08:14:54,837 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1058838.0, ans=0.125 2023-06-24 08:14:55,882 INFO [train.py:996] (2/4) Epoch 6, batch 24000, loss[loss=0.2383, simple_loss=0.3122, pruned_loss=0.08225, over 21728.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.311, pruned_loss=0.08116, over 4281116.92 frames. ], batch size: 298, lr: 4.97e-03, grad_scale: 32.0 2023-06-24 08:14:55,882 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-24 08:15:17,156 INFO [train.py:1028] (2/4) Epoch 6, validation: loss=0.2634, simple_loss=0.3603, pruned_loss=0.08319, over 1796401.00 frames. 2023-06-24 08:15:17,157 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 23731MB 2023-06-24 08:15:43,789 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.40 vs. limit=15.0 2023-06-24 08:16:15,105 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1059018.0, ans=0.0 2023-06-24 08:16:16,721 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 08:17:08,164 INFO [train.py:996] (2/4) Epoch 6, batch 24050, loss[loss=0.2293, simple_loss=0.3154, pruned_loss=0.07162, over 21865.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.3127, pruned_loss=0.08195, over 4279658.90 frames. ], batch size: 371, lr: 4.97e-03, grad_scale: 16.0 2023-06-24 08:17:10,493 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1059138.0, ans=0.125 2023-06-24 08:17:56,746 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.988e+02 2.625e+02 3.028e+02 3.764e+02 6.671e+02, threshold=6.056e+02, percent-clipped=1.0 2023-06-24 08:18:09,711 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1059258.0, ans=0.125 2023-06-24 08:18:17,246 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.98 vs. limit=10.0 2023-06-24 08:18:59,230 INFO [train.py:996] (2/4) Epoch 6, batch 24100, loss[loss=0.2236, simple_loss=0.2824, pruned_loss=0.08243, over 20153.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.3137, pruned_loss=0.08092, over 4277091.39 frames. ], batch size: 703, lr: 4.97e-03, grad_scale: 16.0 2023-06-24 08:18:59,855 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1059438.0, ans=0.125 2023-06-24 08:19:12,560 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1059438.0, ans=0.0 2023-06-24 08:19:17,843 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1059498.0, ans=0.1 2023-06-24 08:20:12,211 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.81 vs. limit=15.0 2023-06-24 08:20:19,686 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.62 vs. limit=22.5 2023-06-24 08:20:26,748 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.89 vs. limit=15.0 2023-06-24 08:20:44,408 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1059678.0, ans=0.1 2023-06-24 08:20:49,110 INFO [train.py:996] (2/4) Epoch 6, batch 24150, loss[loss=0.219, simple_loss=0.285, pruned_loss=0.07656, over 21880.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.3129, pruned_loss=0.08225, over 4279652.85 frames. ], batch size: 298, lr: 4.97e-03, grad_scale: 16.0 2023-06-24 08:21:00,586 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1059738.0, ans=0.2 2023-06-24 08:21:10,384 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.45 vs. limit=15.0 2023-06-24 08:21:37,115 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1059858.0, ans=0.125 2023-06-24 08:21:43,503 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.053e+02 2.678e+02 3.013e+02 3.443e+02 5.621e+02, threshold=6.026e+02, percent-clipped=0.0 2023-06-24 08:22:08,811 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1059918.0, ans=0.1 2023-06-24 08:22:40,790 INFO [train.py:996] (2/4) Epoch 6, batch 24200, loss[loss=0.2308, simple_loss=0.3108, pruned_loss=0.07539, over 21642.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.3164, pruned_loss=0.08431, over 4289204.20 frames. ], batch size: 263, lr: 4.97e-03, grad_scale: 16.0 2023-06-24 08:22:43,612 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1060038.0, ans=0.125 2023-06-24 08:22:58,150 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1060098.0, ans=0.0 2023-06-24 08:23:09,064 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1060098.0, ans=0.0 2023-06-24 08:23:25,208 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1060158.0, ans=0.125 2023-06-24 08:23:35,282 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1060158.0, ans=0.125 2023-06-24 08:23:35,313 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1060158.0, ans=0.125 2023-06-24 08:24:27,616 INFO [train.py:996] (2/4) Epoch 6, batch 24250, loss[loss=0.1949, simple_loss=0.2843, pruned_loss=0.05276, over 21629.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.3121, pruned_loss=0.07753, over 4285952.58 frames. ], batch size: 230, lr: 4.97e-03, grad_scale: 16.0 2023-06-24 08:24:51,491 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1060338.0, ans=0.125 2023-06-24 08:24:55,342 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.31 vs. limit=15.0 2023-06-24 08:25:03,572 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1060398.0, ans=0.125 2023-06-24 08:25:25,865 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.522e+02 2.253e+02 2.770e+02 3.370e+02 5.813e+02, threshold=5.539e+02, percent-clipped=0.0 2023-06-24 08:26:15,755 INFO [train.py:996] (2/4) Epoch 6, batch 24300, loss[loss=0.1944, simple_loss=0.2831, pruned_loss=0.05283, over 21341.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.3044, pruned_loss=0.07195, over 4284599.54 frames. ], batch size: 548, lr: 4.97e-03, grad_scale: 16.0 2023-06-24 08:27:26,689 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1060818.0, ans=0.125 2023-06-24 08:27:45,680 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.23 vs. limit=12.0 2023-06-24 08:28:09,180 INFO [train.py:996] (2/4) Epoch 6, batch 24350, loss[loss=0.2204, simple_loss=0.2922, pruned_loss=0.07425, over 21678.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.3023, pruned_loss=0.07252, over 4288193.39 frames. ], batch size: 263, lr: 4.97e-03, grad_scale: 16.0 2023-06-24 08:28:18,486 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1060938.0, ans=0.1 2023-06-24 08:29:01,797 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.594e+02 2.610e+02 2.946e+02 3.475e+02 5.631e+02, threshold=5.892e+02, percent-clipped=1.0 2023-06-24 08:29:06,964 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.11 vs. limit=15.0 2023-06-24 08:29:58,777 INFO [train.py:996] (2/4) Epoch 6, batch 24400, loss[loss=0.2055, simple_loss=0.2748, pruned_loss=0.0681, over 20795.00 frames. ], tot_loss[loss=0.229, simple_loss=0.3066, pruned_loss=0.07565, over 4284295.49 frames. ], batch size: 608, lr: 4.97e-03, grad_scale: 32.0 2023-06-24 08:31:40,135 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.87 vs. limit=10.0 2023-06-24 08:31:49,164 INFO [train.py:996] (2/4) Epoch 6, batch 24450, loss[loss=0.2343, simple_loss=0.3238, pruned_loss=0.07247, over 21695.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3081, pruned_loss=0.07663, over 4284459.82 frames. ], batch size: 298, lr: 4.96e-03, grad_scale: 32.0 2023-06-24 08:31:53,537 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1061538.0, ans=0.1 2023-06-24 08:32:14,402 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1061598.0, ans=0.125 2023-06-24 08:32:41,891 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.056e+02 2.780e+02 3.190e+02 3.668e+02 5.575e+02, threshold=6.380e+02, percent-clipped=0.0 2023-06-24 08:32:44,206 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1061658.0, ans=0.125 2023-06-24 08:33:00,036 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1061718.0, ans=0.125 2023-06-24 08:33:10,947 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.43 vs. limit=15.0 2023-06-24 08:33:21,272 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.74 vs. limit=6.0 2023-06-24 08:33:22,282 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1061778.0, ans=0.125 2023-06-24 08:33:37,488 INFO [train.py:996] (2/4) Epoch 6, batch 24500, loss[loss=0.2459, simple_loss=0.3083, pruned_loss=0.0917, over 21549.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3078, pruned_loss=0.07678, over 4280147.53 frames. ], batch size: 548, lr: 4.96e-03, grad_scale: 32.0 2023-06-24 08:34:24,313 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1061958.0, ans=0.04949747468305833 2023-06-24 08:34:29,828 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_na.min_abs, batch_count=1061958.0, ans=0.02 2023-06-24 08:34:31,388 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1061958.0, ans=0.0 2023-06-24 08:34:35,457 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1061958.0, ans=0.125 2023-06-24 08:34:40,820 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1062018.0, ans=0.04949747468305833 2023-06-24 08:35:02,599 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 08:35:28,564 INFO [train.py:996] (2/4) Epoch 6, batch 24550, loss[loss=0.2698, simple_loss=0.3445, pruned_loss=0.09758, over 21201.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3099, pruned_loss=0.07829, over 4286355.59 frames. ], batch size: 143, lr: 4.96e-03, grad_scale: 16.0 2023-06-24 08:36:07,076 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1062198.0, ans=0.025 2023-06-24 08:36:18,658 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.031e+02 2.580e+02 2.942e+02 3.468e+02 6.882e+02, threshold=5.884e+02, percent-clipped=1.0 2023-06-24 08:36:38,579 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1062318.0, ans=0.125 2023-06-24 08:36:42,408 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1062318.0, ans=0.125 2023-06-24 08:36:42,436 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1062318.0, ans=0.125 2023-06-24 08:36:44,075 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1062318.0, ans=0.2 2023-06-24 08:37:18,655 INFO [train.py:996] (2/4) Epoch 6, batch 24600, loss[loss=0.2053, simple_loss=0.2707, pruned_loss=0.0699, over 21689.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.3058, pruned_loss=0.07859, over 4283472.11 frames. ], batch size: 333, lr: 4.96e-03, grad_scale: 16.0 2023-06-24 08:37:19,362 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1062438.0, ans=0.125 2023-06-24 08:38:07,067 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1062558.0, ans=0.0 2023-06-24 08:39:14,755 INFO [train.py:996] (2/4) Epoch 6, batch 24650, loss[loss=0.2048, simple_loss=0.2719, pruned_loss=0.06885, over 21349.00 frames. ], tot_loss[loss=0.2266, simple_loss=0.2991, pruned_loss=0.077, over 4283973.09 frames. ], batch size: 131, lr: 4.96e-03, grad_scale: 16.0 2023-06-24 08:39:15,322 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1062738.0, ans=0.125 2023-06-24 08:39:20,380 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1062738.0, ans=0.2 2023-06-24 08:39:24,046 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1062738.0, ans=0.1 2023-06-24 08:40:03,036 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.821e+02 2.707e+02 3.176e+02 3.617e+02 5.573e+02, threshold=6.353e+02, percent-clipped=0.0 2023-06-24 08:40:18,399 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1062918.0, ans=0.1 2023-06-24 08:41:03,583 INFO [train.py:996] (2/4) Epoch 6, batch 24700, loss[loss=0.1864, simple_loss=0.26, pruned_loss=0.05636, over 21558.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.2968, pruned_loss=0.07479, over 4275489.38 frames. ], batch size: 263, lr: 4.96e-03, grad_scale: 16.0 2023-06-24 08:41:25,300 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1063098.0, ans=0.125 2023-06-24 08:42:52,516 INFO [train.py:996] (2/4) Epoch 6, batch 24750, loss[loss=0.1866, simple_loss=0.2472, pruned_loss=0.06302, over 21189.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.2927, pruned_loss=0.07205, over 4255402.61 frames. ], batch size: 176, lr: 4.96e-03, grad_scale: 16.0 2023-06-24 08:43:03,860 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1063338.0, ans=0.2 2023-06-24 08:43:07,051 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1063338.0, ans=0.1 2023-06-24 08:43:27,360 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.86 vs. limit=15.0 2023-06-24 08:43:35,287 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1063458.0, ans=0.125 2023-06-24 08:43:35,315 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1063458.0, ans=0.07 2023-06-24 08:43:41,626 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.995e+02 2.438e+02 2.880e+02 3.643e+02 9.109e+02, threshold=5.760e+02, percent-clipped=1.0 2023-06-24 08:44:24,827 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1063578.0, ans=0.0 2023-06-24 08:44:36,342 INFO [train.py:996] (2/4) Epoch 6, batch 24800, loss[loss=0.2221, simple_loss=0.2922, pruned_loss=0.07595, over 21917.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2873, pruned_loss=0.07257, over 4260402.62 frames. ], batch size: 333, lr: 4.96e-03, grad_scale: 32.0 2023-06-24 08:45:50,449 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1063818.0, ans=0.0 2023-06-24 08:46:19,075 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1063878.0, ans=0.125 2023-06-24 08:46:20,613 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1063878.0, ans=0.125 2023-06-24 08:46:26,850 INFO [train.py:996] (2/4) Epoch 6, batch 24850, loss[loss=0.1888, simple_loss=0.2574, pruned_loss=0.06006, over 21630.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.2883, pruned_loss=0.07441, over 4273606.28 frames. ], batch size: 230, lr: 4.96e-03, grad_scale: 32.0 2023-06-24 08:47:21,026 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.206e+02 2.836e+02 3.370e+02 3.940e+02 7.201e+02, threshold=6.739e+02, percent-clipped=1.0 2023-06-24 08:48:21,636 INFO [train.py:996] (2/4) Epoch 6, batch 24900, loss[loss=0.2433, simple_loss=0.3176, pruned_loss=0.08446, over 21748.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.2907, pruned_loss=0.07537, over 4277422.75 frames. ], batch size: 332, lr: 4.96e-03, grad_scale: 32.0 2023-06-24 08:48:28,383 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.47 vs. limit=15.0 2023-06-24 08:48:42,458 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1064298.0, ans=0.1 2023-06-24 08:50:06,572 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.77 vs. limit=10.0 2023-06-24 08:50:14,173 INFO [train.py:996] (2/4) Epoch 6, batch 24950, loss[loss=0.2616, simple_loss=0.3459, pruned_loss=0.08868, over 21819.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.2982, pruned_loss=0.07907, over 4276414.45 frames. ], batch size: 118, lr: 4.96e-03, grad_scale: 16.0 2023-06-24 08:50:26,202 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1064538.0, ans=0.1 2023-06-24 08:51:12,089 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.415e+02 2.867e+02 3.295e+02 3.992e+02 6.156e+02, threshold=6.590e+02, percent-clipped=0.0 2023-06-24 08:51:40,825 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1064718.0, ans=0.2 2023-06-24 08:51:41,464 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.78 vs. limit=15.0 2023-06-24 08:51:42,222 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1064718.0, ans=0.1 2023-06-24 08:52:06,482 INFO [train.py:996] (2/4) Epoch 6, batch 25000, loss[loss=0.2204, simple_loss=0.3062, pruned_loss=0.06731, over 20725.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3038, pruned_loss=0.08009, over 4282471.42 frames. ], batch size: 607, lr: 4.96e-03, grad_scale: 16.0 2023-06-24 08:53:20,113 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1065018.0, ans=0.0 2023-06-24 08:53:32,406 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1065018.0, ans=0.07 2023-06-24 08:53:53,526 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1065138.0, ans=0.05 2023-06-24 08:53:54,474 INFO [train.py:996] (2/4) Epoch 6, batch 25050, loss[loss=0.2054, simple_loss=0.2746, pruned_loss=0.06811, over 22017.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.2964, pruned_loss=0.07826, over 4278720.44 frames. ], batch size: 103, lr: 4.96e-03, grad_scale: 16.0 2023-06-24 08:54:12,941 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1065138.0, ans=0.2 2023-06-24 08:54:35,856 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1065198.0, ans=0.0 2023-06-24 08:54:35,934 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1065198.0, ans=0.125 2023-06-24 08:54:56,059 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.014e+02 2.544e+02 2.890e+02 3.638e+02 5.399e+02, threshold=5.780e+02, percent-clipped=0.0 2023-06-24 08:55:44,182 INFO [train.py:996] (2/4) Epoch 6, batch 25100, loss[loss=0.2243, simple_loss=0.3158, pruned_loss=0.06634, over 21676.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.2916, pruned_loss=0.07761, over 4269433.49 frames. ], batch size: 332, lr: 4.96e-03, grad_scale: 16.0 2023-06-24 08:55:44,892 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1065438.0, ans=0.0 2023-06-24 08:56:22,915 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1065498.0, ans=0.125 2023-06-24 08:56:22,992 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1065498.0, ans=0.125 2023-06-24 08:57:31,190 INFO [train.py:996] (2/4) Epoch 6, batch 25150, loss[loss=0.2091, simple_loss=0.2997, pruned_loss=0.05927, over 21826.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.2946, pruned_loss=0.07537, over 4264764.33 frames. ], batch size: 351, lr: 4.95e-03, grad_scale: 16.0 2023-06-24 08:58:27,072 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.838e+02 2.368e+02 2.837e+02 3.510e+02 8.139e+02, threshold=5.674e+02, percent-clipped=4.0 2023-06-24 08:58:42,630 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1065918.0, ans=0.0 2023-06-24 08:59:20,750 INFO [train.py:996] (2/4) Epoch 6, batch 25200, loss[loss=0.1916, simple_loss=0.2763, pruned_loss=0.05342, over 21386.00 frames. ], tot_loss[loss=0.2215, simple_loss=0.2949, pruned_loss=0.07403, over 4268126.61 frames. ], batch size: 131, lr: 4.95e-03, grad_scale: 32.0 2023-06-24 08:59:35,280 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1066038.0, ans=10.0 2023-06-24 09:00:02,059 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.26 vs. limit=22.5 2023-06-24 09:00:37,611 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.58 vs. limit=22.5 2023-06-24 09:01:08,295 INFO [train.py:996] (2/4) Epoch 6, batch 25250, loss[loss=0.1937, simple_loss=0.2574, pruned_loss=0.06503, over 21227.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.2919, pruned_loss=0.07198, over 4260123.79 frames. ], batch size: 548, lr: 4.95e-03, grad_scale: 32.0 2023-06-24 09:01:12,476 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1066338.0, ans=0.0 2023-06-24 09:01:48,381 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.56 vs. limit=15.0 2023-06-24 09:02:12,213 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.941e+02 2.424e+02 2.718e+02 3.085e+02 4.421e+02, threshold=5.437e+02, percent-clipped=0.0 2023-06-24 09:02:57,550 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1066638.0, ans=0.07 2023-06-24 09:02:58,791 INFO [train.py:996] (2/4) Epoch 6, batch 25300, loss[loss=0.2233, simple_loss=0.3, pruned_loss=0.07329, over 21723.00 frames. ], tot_loss[loss=0.2167, simple_loss=0.2897, pruned_loss=0.07181, over 4249015.67 frames. ], batch size: 298, lr: 4.95e-03, grad_scale: 16.0 2023-06-24 09:03:01,572 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1066638.0, ans=0.125 2023-06-24 09:03:18,527 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1066638.0, ans=0.125 2023-06-24 09:03:48,769 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1066758.0, ans=0.0 2023-06-24 09:04:05,103 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.08 vs. limit=22.5 2023-06-24 09:04:06,896 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=12.99 vs. limit=15.0 2023-06-24 09:04:35,042 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1066878.0, ans=0.0 2023-06-24 09:04:48,222 INFO [train.py:996] (2/4) Epoch 6, batch 25350, loss[loss=0.1777, simple_loss=0.2643, pruned_loss=0.04557, over 21751.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2925, pruned_loss=0.07179, over 4239294.53 frames. ], batch size: 282, lr: 4.95e-03, grad_scale: 16.0 2023-06-24 09:05:13,720 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1066998.0, ans=0.2 2023-06-24 09:05:50,931 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.891e+02 2.518e+02 2.873e+02 3.506e+02 6.244e+02, threshold=5.746e+02, percent-clipped=2.0 2023-06-24 09:06:02,178 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1067118.0, ans=0.125 2023-06-24 09:06:35,407 INFO [train.py:996] (2/4) Epoch 6, batch 25400, loss[loss=0.2246, simple_loss=0.2975, pruned_loss=0.0759, over 21745.00 frames. ], tot_loss[loss=0.2147, simple_loss=0.2882, pruned_loss=0.07062, over 4244811.95 frames. ], batch size: 112, lr: 4.95e-03, grad_scale: 16.0 2023-06-24 09:07:19,723 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1067358.0, ans=0.0 2023-06-24 09:07:46,128 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1067418.0, ans=0.125 2023-06-24 09:08:25,245 INFO [train.py:996] (2/4) Epoch 6, batch 25450, loss[loss=0.1915, simple_loss=0.2938, pruned_loss=0.04462, over 21791.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2893, pruned_loss=0.07151, over 4243739.90 frames. ], batch size: 282, lr: 4.95e-03, grad_scale: 16.0 2023-06-24 09:08:25,856 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1067538.0, ans=0.1 2023-06-24 09:08:39,582 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1067538.0, ans=0.1 2023-06-24 09:09:27,536 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1067658.0, ans=0.0 2023-06-24 09:09:30,507 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.916e+02 2.400e+02 2.613e+02 3.023e+02 4.754e+02, threshold=5.227e+02, percent-clipped=0.0 2023-06-24 09:10:02,495 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1067778.0, ans=0.1 2023-06-24 09:10:09,829 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1067778.0, ans=0.125 2023-06-24 09:10:18,885 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1067778.0, ans=0.125 2023-06-24 09:10:23,347 INFO [train.py:996] (2/4) Epoch 6, batch 25500, loss[loss=0.2096, simple_loss=0.2987, pruned_loss=0.06026, over 21730.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.2895, pruned_loss=0.06913, over 4243254.07 frames. ], batch size: 332, lr: 4.95e-03, grad_scale: 16.0 2023-06-24 09:10:24,271 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1067838.0, ans=0.2 2023-06-24 09:10:48,115 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1067898.0, ans=0.1 2023-06-24 09:11:14,956 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1067958.0, ans=0.0 2023-06-24 09:11:42,067 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1068018.0, ans=0.2 2023-06-24 09:11:57,870 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1068078.0, ans=0.125 2023-06-24 09:11:57,984 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1068078.0, ans=0.0 2023-06-24 09:12:14,537 INFO [train.py:996] (2/4) Epoch 6, batch 25550, loss[loss=0.2036, simple_loss=0.3203, pruned_loss=0.04348, over 21261.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.295, pruned_loss=0.06939, over 4240670.79 frames. ], batch size: 548, lr: 4.95e-03, grad_scale: 16.0 2023-06-24 09:12:20,963 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1068138.0, ans=0.125 2023-06-24 09:12:35,785 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1068138.0, ans=0.1 2023-06-24 09:13:20,097 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.858e+02 2.387e+02 2.706e+02 3.316e+02 5.632e+02, threshold=5.413e+02, percent-clipped=1.0 2023-06-24 09:14:05,893 INFO [train.py:996] (2/4) Epoch 6, batch 25600, loss[loss=0.2834, simple_loss=0.402, pruned_loss=0.08246, over 19752.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.3017, pruned_loss=0.07019, over 4247353.39 frames. ], batch size: 702, lr: 4.95e-03, grad_scale: 32.0 2023-06-24 09:14:06,667 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1068438.0, ans=0.0 2023-06-24 09:14:23,140 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1068438.0, ans=0.0 2023-06-24 09:14:56,478 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1068558.0, ans=0.125 2023-06-24 09:14:56,503 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1068558.0, ans=0.0 2023-06-24 09:15:05,363 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1068558.0, ans=0.0 2023-06-24 09:15:13,893 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1068618.0, ans=0.125 2023-06-24 09:15:17,190 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1068618.0, ans=0.125 2023-06-24 09:16:00,295 INFO [train.py:996] (2/4) Epoch 6, batch 25650, loss[loss=0.2179, simple_loss=0.2892, pruned_loss=0.07326, over 21773.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.3031, pruned_loss=0.07293, over 4250677.40 frames. ], batch size: 124, lr: 4.95e-03, grad_scale: 32.0 2023-06-24 09:16:02,166 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1068738.0, ans=0.035 2023-06-24 09:16:37,961 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1068858.0, ans=0.0 2023-06-24 09:16:53,662 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1068858.0, ans=0.125 2023-06-24 09:16:56,694 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.182e+02 2.676e+02 3.048e+02 3.761e+02 7.606e+02, threshold=6.096e+02, percent-clipped=4.0 2023-06-24 09:17:04,794 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.27 vs. limit=12.0 2023-06-24 09:17:33,571 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1068978.0, ans=0.0 2023-06-24 09:17:41,418 INFO [train.py:996] (2/4) Epoch 6, batch 25700, loss[loss=0.2502, simple_loss=0.3226, pruned_loss=0.08891, over 21528.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.2991, pruned_loss=0.07396, over 4251378.62 frames. ], batch size: 471, lr: 4.95e-03, grad_scale: 32.0 2023-06-24 09:18:13,304 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1069098.0, ans=0.125 2023-06-24 09:18:15,531 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.99 vs. limit=15.0 2023-06-24 09:18:46,025 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1069158.0, ans=0.125 2023-06-24 09:18:46,113 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1069158.0, ans=0.125 2023-06-24 09:18:48,227 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1069158.0, ans=0.2 2023-06-24 09:18:57,434 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.56 vs. limit=22.5 2023-06-24 09:19:39,971 INFO [train.py:996] (2/4) Epoch 6, batch 25750, loss[loss=0.2745, simple_loss=0.3586, pruned_loss=0.09515, over 21702.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.3028, pruned_loss=0.07634, over 4249206.45 frames. ], batch size: 351, lr: 4.95e-03, grad_scale: 32.0 2023-06-24 09:19:54,585 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1069338.0, ans=0.025 2023-06-24 09:20:18,109 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1069398.0, ans=0.05 2023-06-24 09:20:20,017 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1069398.0, ans=0.09899494936611666 2023-06-24 09:20:43,891 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.169e+02 2.640e+02 3.088e+02 3.573e+02 6.081e+02, threshold=6.175e+02, percent-clipped=0.0 2023-06-24 09:20:55,512 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1069518.0, ans=0.125 2023-06-24 09:21:26,556 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1069578.0, ans=0.0 2023-06-24 09:21:39,054 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1069578.0, ans=0.1 2023-06-24 09:21:39,116 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1069578.0, ans=0.125 2023-06-24 09:21:42,026 INFO [train.py:996] (2/4) Epoch 6, batch 25800, loss[loss=0.2242, simple_loss=0.2928, pruned_loss=0.07774, over 20744.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3131, pruned_loss=0.08072, over 4254127.87 frames. ], batch size: 607, lr: 4.95e-03, grad_scale: 32.0 2023-06-24 09:22:45,543 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1069818.0, ans=0.125 2023-06-24 09:22:49,021 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1069818.0, ans=0.1 2023-06-24 09:22:57,212 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1069818.0, ans=0.1 2023-06-24 09:23:04,327 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1069818.0, ans=0.125 2023-06-24 09:23:30,441 INFO [train.py:996] (2/4) Epoch 6, batch 25850, loss[loss=0.2143, simple_loss=0.2873, pruned_loss=0.07065, over 21426.00 frames. ], tot_loss[loss=0.2382, simple_loss=0.3156, pruned_loss=0.08039, over 4260263.84 frames. ], batch size: 211, lr: 4.94e-03, grad_scale: 32.0 2023-06-24 09:23:54,208 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1069998.0, ans=0.1 2023-06-24 09:24:29,139 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.006e+02 2.685e+02 2.967e+02 3.484e+02 6.005e+02, threshold=5.935e+02, percent-clipped=0.0 2023-06-24 09:25:09,344 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1070178.0, ans=0.125 2023-06-24 09:25:21,123 INFO [train.py:996] (2/4) Epoch 6, batch 25900, loss[loss=0.3228, simple_loss=0.4149, pruned_loss=0.1153, over 21305.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.3194, pruned_loss=0.08184, over 4261615.12 frames. ], batch size: 548, lr: 4.94e-03, grad_scale: 32.0 2023-06-24 09:26:17,166 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1070358.0, ans=10.0 2023-06-24 09:26:26,405 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1070358.0, ans=0.125 2023-06-24 09:26:52,340 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1070478.0, ans=0.2 2023-06-24 09:27:16,102 INFO [train.py:996] (2/4) Epoch 6, batch 25950, loss[loss=0.241, simple_loss=0.3266, pruned_loss=0.07767, over 21757.00 frames. ], tot_loss[loss=0.2452, simple_loss=0.3236, pruned_loss=0.08342, over 4270300.45 frames. ], batch size: 332, lr: 4.94e-03, grad_scale: 32.0 2023-06-24 09:27:28,424 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn2.whiten.whitening_limit, batch_count=1070538.0, ans=22.5 2023-06-24 09:28:14,272 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1070658.0, ans=0.0 2023-06-24 09:28:14,782 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.61 vs. limit=15.0 2023-06-24 09:28:20,266 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.79 vs. limit=22.5 2023-06-24 09:28:20,616 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.030e+02 2.613e+02 2.969e+02 3.394e+02 6.568e+02, threshold=5.938e+02, percent-clipped=2.0 2023-06-24 09:29:06,464 INFO [train.py:996] (2/4) Epoch 6, batch 26000, loss[loss=0.2643, simple_loss=0.3513, pruned_loss=0.08867, over 21486.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3235, pruned_loss=0.08152, over 4268240.57 frames. ], batch size: 131, lr: 4.94e-03, grad_scale: 32.0 2023-06-24 09:29:32,094 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1070898.0, ans=0.125 2023-06-24 09:29:32,882 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=17.10 vs. limit=22.5 2023-06-24 09:30:16,426 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1071018.0, ans=0.1 2023-06-24 09:30:54,521 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1071138.0, ans=0.125 2023-06-24 09:31:00,868 INFO [train.py:996] (2/4) Epoch 6, batch 26050, loss[loss=0.2194, simple_loss=0.2819, pruned_loss=0.0784, over 21712.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.3235, pruned_loss=0.08247, over 4273589.80 frames. ], batch size: 230, lr: 4.94e-03, grad_scale: 16.0 2023-06-24 09:31:04,874 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1071138.0, ans=0.125 2023-06-24 09:31:26,845 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1071198.0, ans=0.1 2023-06-24 09:31:54,458 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1071258.0, ans=0.125 2023-06-24 09:31:58,911 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.920e+02 2.591e+02 3.026e+02 3.549e+02 5.342e+02, threshold=6.052e+02, percent-clipped=0.0 2023-06-24 09:32:15,421 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1071318.0, ans=0.125 2023-06-24 09:32:37,778 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1071378.0, ans=0.0 2023-06-24 09:32:40,151 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.49 vs. limit=15.0 2023-06-24 09:32:47,963 INFO [train.py:996] (2/4) Epoch 6, batch 26100, loss[loss=0.2397, simple_loss=0.3038, pruned_loss=0.08778, over 21913.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.3168, pruned_loss=0.0825, over 4277688.63 frames. ], batch size: 414, lr: 4.94e-03, grad_scale: 16.0 2023-06-24 09:33:00,018 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.60 vs. limit=15.0 2023-06-24 09:33:04,800 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1071498.0, ans=0.125 2023-06-24 09:33:08,418 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1071498.0, ans=0.125 2023-06-24 09:33:29,952 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.53 vs. limit=15.0 2023-06-24 09:33:32,830 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1071558.0, ans=0.04949747468305833 2023-06-24 09:33:32,871 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1071558.0, ans=0.09899494936611666 2023-06-24 09:34:20,860 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1071678.0, ans=0.05 2023-06-24 09:34:35,731 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1071678.0, ans=0.1 2023-06-24 09:34:37,567 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1071738.0, ans=0.07 2023-06-24 09:34:38,499 INFO [train.py:996] (2/4) Epoch 6, batch 26150, loss[loss=0.3087, simple_loss=0.3575, pruned_loss=0.13, over 21528.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3143, pruned_loss=0.08297, over 4283835.99 frames. ], batch size: 510, lr: 4.94e-03, grad_scale: 16.0 2023-06-24 09:34:55,220 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1071798.0, ans=0.1 2023-06-24 09:35:02,024 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1071798.0, ans=0.0 2023-06-24 09:35:16,926 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1071798.0, ans=0.0 2023-06-24 09:35:39,632 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.029e+02 2.601e+02 2.864e+02 3.408e+02 4.627e+02, threshold=5.727e+02, percent-clipped=0.0 2023-06-24 09:36:28,875 INFO [train.py:996] (2/4) Epoch 6, batch 26200, loss[loss=0.2466, simple_loss=0.3587, pruned_loss=0.06727, over 21219.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3151, pruned_loss=0.0809, over 4281815.78 frames. ], batch size: 548, lr: 4.94e-03, grad_scale: 16.0 2023-06-24 09:38:00,041 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.88 vs. limit=15.0 2023-06-24 09:38:07,787 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1072278.0, ans=0.0 2023-06-24 09:38:22,427 INFO [train.py:996] (2/4) Epoch 6, batch 26250, loss[loss=0.2225, simple_loss=0.2976, pruned_loss=0.0737, over 21684.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3163, pruned_loss=0.07915, over 4278868.70 frames. ], batch size: 263, lr: 4.94e-03, grad_scale: 16.0 2023-06-24 09:39:20,860 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.139e+02 2.533e+02 2.809e+02 3.331e+02 4.740e+02, threshold=5.619e+02, percent-clipped=0.0 2023-06-24 09:40:01,091 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.12 vs. limit=15.0 2023-06-24 09:40:16,013 INFO [train.py:996] (2/4) Epoch 6, batch 26300, loss[loss=0.2314, simple_loss=0.3071, pruned_loss=0.07783, over 21529.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3132, pruned_loss=0.07956, over 4285810.60 frames. ], batch size: 131, lr: 4.94e-03, grad_scale: 16.0 2023-06-24 09:40:22,160 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1072638.0, ans=0.0 2023-06-24 09:40:23,846 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1072638.0, ans=0.0 2023-06-24 09:41:57,823 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1072878.0, ans=0.125 2023-06-24 09:42:05,745 INFO [train.py:996] (2/4) Epoch 6, batch 26350, loss[loss=0.2164, simple_loss=0.2739, pruned_loss=0.07951, over 21193.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.312, pruned_loss=0.08012, over 4293580.10 frames. ], batch size: 608, lr: 4.94e-03, grad_scale: 16.0 2023-06-24 09:42:23,510 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 09:42:26,027 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.55 vs. limit=15.0 2023-06-24 09:42:57,921 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.51 vs. limit=6.0 2023-06-24 09:42:58,167 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.100e+02 2.852e+02 3.248e+02 3.843e+02 6.054e+02, threshold=6.496e+02, percent-clipped=2.0 2023-06-24 09:43:53,731 INFO [train.py:996] (2/4) Epoch 6, batch 26400, loss[loss=0.2353, simple_loss=0.2934, pruned_loss=0.08858, over 21241.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3057, pruned_loss=0.07973, over 4297269.33 frames. ], batch size: 143, lr: 4.94e-03, grad_scale: 32.0 2023-06-24 09:43:57,895 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=1073238.0, ans=10.0 2023-06-24 09:43:58,647 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.44 vs. limit=15.0 2023-06-24 09:43:58,713 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=8.22 vs. limit=15.0 2023-06-24 09:44:31,694 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1073298.0, ans=0.125 2023-06-24 09:44:38,097 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.39 vs. limit=22.5 2023-06-24 09:45:04,586 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1073418.0, ans=0.125 2023-06-24 09:45:06,268 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1073418.0, ans=0.0 2023-06-24 09:45:27,903 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 09:45:41,016 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 09:45:50,255 INFO [train.py:996] (2/4) Epoch 6, batch 26450, loss[loss=0.2217, simple_loss=0.2894, pruned_loss=0.07703, over 21300.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.3048, pruned_loss=0.07954, over 4290332.10 frames. ], batch size: 144, lr: 4.94e-03, grad_scale: 32.0 2023-06-24 09:46:14,748 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1073598.0, ans=0.2 2023-06-24 09:46:18,253 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1073598.0, ans=0.1 2023-06-24 09:46:50,152 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.007e+02 2.812e+02 3.126e+02 4.062e+02 8.206e+02, threshold=6.252e+02, percent-clipped=4.0 2023-06-24 09:47:09,446 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.whiten.whitening_limit, batch_count=1073718.0, ans=12.0 2023-06-24 09:47:39,831 INFO [train.py:996] (2/4) Epoch 6, batch 26500, loss[loss=0.1672, simple_loss=0.2313, pruned_loss=0.05158, over 21348.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.306, pruned_loss=0.07814, over 4284688.65 frames. ], batch size: 131, lr: 4.94e-03, grad_scale: 32.0 2023-06-24 09:48:25,278 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1073958.0, ans=0.04949747468305833 2023-06-24 09:49:03,006 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1074018.0, ans=0.0 2023-06-24 09:49:17,862 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1074078.0, ans=0.125 2023-06-24 09:49:31,897 INFO [train.py:996] (2/4) Epoch 6, batch 26550, loss[loss=0.2155, simple_loss=0.308, pruned_loss=0.06155, over 21719.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.3058, pruned_loss=0.07547, over 4269261.60 frames. ], batch size: 351, lr: 4.94e-03, grad_scale: 32.0 2023-06-24 09:50:42,538 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.986e+02 2.610e+02 3.106e+02 3.674e+02 5.828e+02, threshold=6.212e+02, percent-clipped=0.0 2023-06-24 09:50:48,443 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1074318.0, ans=0.125 2023-06-24 09:51:26,545 INFO [train.py:996] (2/4) Epoch 6, batch 26600, loss[loss=0.1974, simple_loss=0.2725, pruned_loss=0.06113, over 21576.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.3054, pruned_loss=0.07285, over 4273842.63 frames. ], batch size: 263, lr: 4.93e-03, grad_scale: 32.0 2023-06-24 09:51:59,193 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1074498.0, ans=0.125 2023-06-24 09:52:13,364 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1074558.0, ans=0.2 2023-06-24 09:52:55,299 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.70 vs. limit=15.0 2023-06-24 09:53:09,207 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1074678.0, ans=0.1 2023-06-24 09:53:15,489 INFO [train.py:996] (2/4) Epoch 6, batch 26650, loss[loss=0.1851, simple_loss=0.2561, pruned_loss=0.05705, over 21553.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.2996, pruned_loss=0.07182, over 4263615.24 frames. ], batch size: 263, lr: 4.93e-03, grad_scale: 32.0 2023-06-24 09:53:36,461 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1074738.0, ans=0.2 2023-06-24 09:54:12,636 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1074858.0, ans=0.0 2023-06-24 09:54:18,563 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.725e+02 2.261e+02 2.468e+02 2.751e+02 5.054e+02, threshold=4.936e+02, percent-clipped=0.0 2023-06-24 09:55:03,188 INFO [train.py:996] (2/4) Epoch 6, batch 26700, loss[loss=0.1589, simple_loss=0.2371, pruned_loss=0.04036, over 21678.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2924, pruned_loss=0.06936, over 4267386.88 frames. ], batch size: 263, lr: 4.93e-03, grad_scale: 32.0 2023-06-24 09:56:27,846 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1075218.0, ans=0.0 2023-06-24 09:56:53,000 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1075278.0, ans=0.125 2023-06-24 09:56:59,441 INFO [train.py:996] (2/4) Epoch 6, batch 26750, loss[loss=0.2866, simple_loss=0.3639, pruned_loss=0.1047, over 21452.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2934, pruned_loss=0.06871, over 4274517.04 frames. ], batch size: 131, lr: 4.93e-03, grad_scale: 16.0 2023-06-24 09:57:31,139 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 09:57:36,335 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1075398.0, ans=0.125 2023-06-24 09:57:55,236 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.775e+02 2.365e+02 2.700e+02 3.222e+02 4.591e+02, threshold=5.400e+02, percent-clipped=0.0 2023-06-24 09:58:49,567 INFO [train.py:996] (2/4) Epoch 6, batch 26800, loss[loss=0.2436, simple_loss=0.3182, pruned_loss=0.08451, over 21478.00 frames. ], tot_loss[loss=0.222, simple_loss=0.3003, pruned_loss=0.07188, over 4272918.37 frames. ], batch size: 194, lr: 4.93e-03, grad_scale: 32.0 2023-06-24 09:59:05,343 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.05 vs. limit=12.0 2023-06-24 09:59:50,737 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.37 vs. limit=15.0 2023-06-24 10:00:23,777 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1075878.0, ans=0.125 2023-06-24 10:00:32,586 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1075878.0, ans=0.2 2023-06-24 10:00:43,907 INFO [train.py:996] (2/4) Epoch 6, batch 26850, loss[loss=0.2149, simple_loss=0.2889, pruned_loss=0.07047, over 21462.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.3015, pruned_loss=0.0747, over 4279064.55 frames. ], batch size: 131, lr: 4.93e-03, grad_scale: 16.0 2023-06-24 10:00:53,094 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1075938.0, ans=0.2 2023-06-24 10:01:11,958 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1075998.0, ans=0.025 2023-06-24 10:01:25,782 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1076058.0, ans=0.125 2023-06-24 10:01:46,051 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.916e+02 2.745e+02 3.127e+02 3.693e+02 5.292e+02, threshold=6.255e+02, percent-clipped=0.0 2023-06-24 10:01:46,849 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1076118.0, ans=10.0 2023-06-24 10:02:22,770 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1076178.0, ans=0.125 2023-06-24 10:02:25,586 INFO [train.py:996] (2/4) Epoch 6, batch 26900, loss[loss=0.1966, simple_loss=0.2569, pruned_loss=0.06819, over 21363.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.2937, pruned_loss=0.07367, over 4283724.00 frames. ], batch size: 211, lr: 4.93e-03, grad_scale: 16.0 2023-06-24 10:03:06,799 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.31 vs. limit=6.0 2023-06-24 10:03:29,676 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1076418.0, ans=0.125 2023-06-24 10:03:31,075 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1076418.0, ans=0.125 2023-06-24 10:04:14,914 INFO [train.py:996] (2/4) Epoch 6, batch 26950, loss[loss=0.2389, simple_loss=0.3287, pruned_loss=0.07452, over 21690.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.2935, pruned_loss=0.07457, over 4279080.18 frames. ], batch size: 332, lr: 4.93e-03, grad_scale: 8.0 2023-06-24 10:04:24,382 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1076538.0, ans=0.0 2023-06-24 10:04:33,343 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1076538.0, ans=0.125 2023-06-24 10:04:38,448 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1076598.0, ans=0.125 2023-06-24 10:04:55,082 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.97 vs. limit=15.0 2023-06-24 10:05:10,768 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1076658.0, ans=0.125 2023-06-24 10:05:23,301 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1076718.0, ans=0.2 2023-06-24 10:05:25,271 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1076718.0, ans=0.04949747468305833 2023-06-24 10:05:26,440 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.965e+02 2.499e+02 2.950e+02 4.079e+02 6.623e+02, threshold=5.900e+02, percent-clipped=3.0 2023-06-24 10:05:44,832 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1076718.0, ans=0.125 2023-06-24 10:06:10,651 INFO [train.py:996] (2/4) Epoch 6, batch 27000, loss[loss=0.2507, simple_loss=0.3399, pruned_loss=0.08075, over 21546.00 frames. ], tot_loss[loss=0.22, simple_loss=0.2944, pruned_loss=0.07278, over 4271030.22 frames. ], batch size: 441, lr: 4.93e-03, grad_scale: 8.0 2023-06-24 10:06:10,651 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-24 10:06:28,776 INFO [train.py:1028] (2/4) Epoch 6, validation: loss=0.2519, simple_loss=0.3439, pruned_loss=0.0799, over 1796401.00 frames. 2023-06-24 10:06:28,778 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 23731MB 2023-06-24 10:07:29,006 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1076958.0, ans=0.125 2023-06-24 10:08:07,213 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.53 vs. limit=15.0 2023-06-24 10:08:08,426 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1077078.0, ans=0.0 2023-06-24 10:08:18,391 INFO [train.py:996] (2/4) Epoch 6, batch 27050, loss[loss=0.205, simple_loss=0.317, pruned_loss=0.04649, over 20725.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.2958, pruned_loss=0.06944, over 4267560.70 frames. ], batch size: 607, lr: 4.93e-03, grad_scale: 8.0 2023-06-24 10:08:29,574 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1077138.0, ans=0.0 2023-06-24 10:08:37,215 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1077138.0, ans=0.125 2023-06-24 10:09:02,124 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1077198.0, ans=0.125 2023-06-24 10:09:34,310 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.742e+02 2.401e+02 2.781e+02 3.239e+02 4.464e+02, threshold=5.563e+02, percent-clipped=0.0 2023-06-24 10:10:08,128 INFO [train.py:996] (2/4) Epoch 6, batch 27100, loss[loss=0.2098, simple_loss=0.2893, pruned_loss=0.06517, over 16983.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.2977, pruned_loss=0.07063, over 4270979.46 frames. ], batch size: 60, lr: 4.93e-03, grad_scale: 8.0 2023-06-24 10:10:25,401 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1077438.0, ans=0.125 2023-06-24 10:10:30,497 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1077498.0, ans=0.1 2023-06-24 10:10:41,474 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1077498.0, ans=0.0 2023-06-24 10:10:41,500 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1077498.0, ans=0.0 2023-06-24 10:11:33,253 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=10.36 vs. limit=15.0 2023-06-24 10:11:52,102 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1077678.0, ans=0.125 2023-06-24 10:11:58,246 INFO [train.py:996] (2/4) Epoch 6, batch 27150, loss[loss=0.2484, simple_loss=0.3225, pruned_loss=0.08714, over 21281.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.3086, pruned_loss=0.07409, over 4271736.32 frames. ], batch size: 159, lr: 4.93e-03, grad_scale: 8.0 2023-06-24 10:13:13,433 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.037e+02 2.603e+02 2.899e+02 3.318e+02 5.343e+02, threshold=5.797e+02, percent-clipped=0.0 2023-06-24 10:13:14,223 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1077918.0, ans=0.0 2023-06-24 10:13:52,987 INFO [train.py:996] (2/4) Epoch 6, batch 27200, loss[loss=0.2236, simple_loss=0.3095, pruned_loss=0.06885, over 21447.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3161, pruned_loss=0.07704, over 4268785.10 frames. ], batch size: 211, lr: 4.93e-03, grad_scale: 16.0 2023-06-24 10:14:23,970 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1078098.0, ans=0.0 2023-06-24 10:14:29,552 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1078098.0, ans=0.2 2023-06-24 10:14:39,572 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1078158.0, ans=0.1 2023-06-24 10:14:46,683 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1078158.0, ans=0.1 2023-06-24 10:15:28,184 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1078278.0, ans=0.0 2023-06-24 10:15:48,396 INFO [train.py:996] (2/4) Epoch 6, batch 27250, loss[loss=0.2164, simple_loss=0.2991, pruned_loss=0.06685, over 22006.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3182, pruned_loss=0.08059, over 4264271.98 frames. ], batch size: 317, lr: 4.93e-03, grad_scale: 16.0 2023-06-24 10:16:27,068 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1078398.0, ans=0.125 2023-06-24 10:16:56,095 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.396e+02 2.982e+02 3.326e+02 3.737e+02 5.172e+02, threshold=6.652e+02, percent-clipped=0.0 2023-06-24 10:17:22,706 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.83 vs. limit=12.0 2023-06-24 10:17:29,133 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1078578.0, ans=0.025 2023-06-24 10:17:36,590 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.03 vs. limit=15.0 2023-06-24 10:17:45,563 INFO [train.py:996] (2/4) Epoch 6, batch 27300, loss[loss=0.2854, simple_loss=0.3603, pruned_loss=0.1053, over 21785.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.3205, pruned_loss=0.08228, over 4265255.62 frames. ], batch size: 124, lr: 4.92e-03, grad_scale: 16.0 2023-06-24 10:18:36,301 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.46 vs. limit=15.0 2023-06-24 10:19:07,396 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.09 vs. limit=6.0 2023-06-24 10:19:33,572 INFO [train.py:996] (2/4) Epoch 6, batch 27350, loss[loss=0.2401, simple_loss=0.3247, pruned_loss=0.07776, over 21412.00 frames. ], tot_loss[loss=0.2441, simple_loss=0.3231, pruned_loss=0.08258, over 4271111.11 frames. ], batch size: 211, lr: 4.92e-03, grad_scale: 16.0 2023-06-24 10:19:34,720 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.16 vs. limit=12.0 2023-06-24 10:19:55,388 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.82 vs. limit=15.0 2023-06-24 10:20:05,205 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1078998.0, ans=0.125 2023-06-24 10:20:10,304 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1079058.0, ans=0.125 2023-06-24 10:20:37,093 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.169e+02 2.617e+02 2.947e+02 3.408e+02 6.075e+02, threshold=5.893e+02, percent-clipped=0.0 2023-06-24 10:21:19,484 INFO [train.py:996] (2/4) Epoch 6, batch 27400, loss[loss=0.2271, simple_loss=0.2943, pruned_loss=0.07993, over 21846.00 frames. ], tot_loss[loss=0.2414, simple_loss=0.3186, pruned_loss=0.08217, over 4265990.08 frames. ], batch size: 118, lr: 4.92e-03, grad_scale: 16.0 2023-06-24 10:22:09,619 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1079358.0, ans=0.0 2023-06-24 10:22:11,142 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1079358.0, ans=0.1 2023-06-24 10:22:31,950 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1079418.0, ans=0.125 2023-06-24 10:22:59,227 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1079478.0, ans=0.125 2023-06-24 10:23:00,778 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1079478.0, ans=0.0 2023-06-24 10:23:07,175 INFO [train.py:996] (2/4) Epoch 6, batch 27450, loss[loss=0.2204, simple_loss=0.3056, pruned_loss=0.0676, over 21639.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3118, pruned_loss=0.08005, over 4260749.66 frames. ], batch size: 231, lr: 4.92e-03, grad_scale: 16.0 2023-06-24 10:24:07,321 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.980e+02 2.466e+02 2.775e+02 3.164e+02 4.697e+02, threshold=5.550e+02, percent-clipped=0.0 2023-06-24 10:24:08,815 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.65 vs. limit=15.0 2023-06-24 10:24:16,558 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1079718.0, ans=0.0 2023-06-24 10:24:50,397 INFO [train.py:996] (2/4) Epoch 6, batch 27500, loss[loss=0.2451, simple_loss=0.3138, pruned_loss=0.08817, over 21247.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3093, pruned_loss=0.08029, over 4266651.35 frames. ], batch size: 143, lr: 4.92e-03, grad_scale: 16.0 2023-06-24 10:25:03,875 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.67 vs. limit=12.0 2023-06-24 10:25:05,037 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1079838.0, ans=0.0 2023-06-24 10:25:18,766 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1079898.0, ans=0.2 2023-06-24 10:26:23,225 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1080078.0, ans=0.025 2023-06-24 10:26:34,095 INFO [train.py:996] (2/4) Epoch 6, batch 27550, loss[loss=0.2208, simple_loss=0.2876, pruned_loss=0.077, over 21592.00 frames. ], tot_loss[loss=0.231, simple_loss=0.3066, pruned_loss=0.07773, over 4267189.37 frames. ], batch size: 414, lr: 4.92e-03, grad_scale: 16.0 2023-06-24 10:27:19,017 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1080258.0, ans=0.04949747468305833 2023-06-24 10:27:43,824 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.907e+02 2.511e+02 2.711e+02 3.223e+02 7.892e+02, threshold=5.422e+02, percent-clipped=3.0 2023-06-24 10:27:54,646 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1080318.0, ans=0.1 2023-06-24 10:28:21,569 INFO [train.py:996] (2/4) Epoch 6, batch 27600, loss[loss=0.2137, simple_loss=0.2764, pruned_loss=0.07555, over 21302.00 frames. ], tot_loss[loss=0.2266, simple_loss=0.3001, pruned_loss=0.07653, over 4260269.28 frames. ], batch size: 471, lr: 4.92e-03, grad_scale: 32.0 2023-06-24 10:28:51,278 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1080498.0, ans=0.125 2023-06-24 10:30:08,111 INFO [train.py:996] (2/4) Epoch 6, batch 27650, loss[loss=0.2495, simple_loss=0.3272, pruned_loss=0.08593, over 21693.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.2939, pruned_loss=0.0754, over 4256377.46 frames. ], batch size: 414, lr: 4.92e-03, grad_scale: 32.0 2023-06-24 10:30:21,215 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.01 vs. limit=15.0 2023-06-24 10:30:23,706 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1080798.0, ans=0.125 2023-06-24 10:30:27,518 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1080798.0, ans=0.0 2023-06-24 10:31:11,328 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1080918.0, ans=0.125 2023-06-24 10:31:12,208 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.960e+02 2.435e+02 2.709e+02 3.081e+02 4.195e+02, threshold=5.419e+02, percent-clipped=0.0 2023-06-24 10:31:35,990 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.54 vs. limit=15.0 2023-06-24 10:31:51,856 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1080978.0, ans=0.1 2023-06-24 10:31:56,498 INFO [train.py:996] (2/4) Epoch 6, batch 27700, loss[loss=0.2066, simple_loss=0.2517, pruned_loss=0.08071, over 20205.00 frames. ], tot_loss[loss=0.22, simple_loss=0.2932, pruned_loss=0.0734, over 4260739.32 frames. ], batch size: 703, lr: 4.92e-03, grad_scale: 32.0 2023-06-24 10:32:48,080 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.26 vs. limit=15.0 2023-06-24 10:32:56,601 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.39 vs. limit=15.0 2023-06-24 10:33:14,161 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.37 vs. limit=15.0 2023-06-24 10:33:25,332 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 10:33:38,235 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.11 vs. limit=15.0 2023-06-24 10:33:45,299 INFO [train.py:996] (2/4) Epoch 6, batch 27750, loss[loss=0.2484, simple_loss=0.3427, pruned_loss=0.07706, over 21237.00 frames. ], tot_loss[loss=0.222, simple_loss=0.2966, pruned_loss=0.07366, over 4259895.19 frames. ], batch size: 548, lr: 4.92e-03, grad_scale: 32.0 2023-06-24 10:33:45,844 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1081338.0, ans=0.0 2023-06-24 10:34:00,169 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.46 vs. limit=6.0 2023-06-24 10:34:18,536 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1081398.0, ans=0.0 2023-06-24 10:34:55,020 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.871e+02 2.574e+02 2.914e+02 3.859e+02 6.202e+02, threshold=5.827e+02, percent-clipped=2.0 2023-06-24 10:35:32,910 INFO [train.py:996] (2/4) Epoch 6, batch 27800, loss[loss=0.1949, simple_loss=0.2674, pruned_loss=0.06121, over 21453.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.2964, pruned_loss=0.07408, over 4268348.54 frames. ], batch size: 211, lr: 4.92e-03, grad_scale: 32.0 2023-06-24 10:35:37,017 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1081638.0, ans=0.1 2023-06-24 10:35:38,617 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1081638.0, ans=0.125 2023-06-24 10:35:45,547 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1081638.0, ans=0.125 2023-06-24 10:35:49,160 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1081698.0, ans=0.125 2023-06-24 10:36:08,213 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1081698.0, ans=0.0 2023-06-24 10:36:22,614 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1081758.0, ans=0.0 2023-06-24 10:36:26,232 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1081758.0, ans=0.125 2023-06-24 10:36:45,472 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1081818.0, ans=0.0 2023-06-24 10:37:21,727 INFO [train.py:996] (2/4) Epoch 6, batch 27850, loss[loss=0.2375, simple_loss=0.333, pruned_loss=0.07101, over 21757.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.2964, pruned_loss=0.07496, over 4281216.66 frames. ], batch size: 298, lr: 4.92e-03, grad_scale: 16.0 2023-06-24 10:37:27,566 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1081938.0, ans=0.2 2023-06-24 10:37:51,000 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1081998.0, ans=0.125 2023-06-24 10:38:05,634 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1081998.0, ans=0.0 2023-06-24 10:38:24,650 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1082058.0, ans=0.1 2023-06-24 10:38:29,084 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.57 vs. limit=15.0 2023-06-24 10:38:39,969 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.988e+02 2.601e+02 3.026e+02 3.751e+02 1.054e+03, threshold=6.053e+02, percent-clipped=6.0 2023-06-24 10:39:11,480 INFO [train.py:996] (2/4) Epoch 6, batch 27900, loss[loss=0.2126, simple_loss=0.2785, pruned_loss=0.07329, over 21233.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.3027, pruned_loss=0.0752, over 4281922.23 frames. ], batch size: 608, lr: 4.92e-03, grad_scale: 16.0 2023-06-24 10:39:24,560 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1082238.0, ans=0.125 2023-06-24 10:41:13,481 INFO [train.py:996] (2/4) Epoch 6, batch 27950, loss[loss=0.2416, simple_loss=0.3291, pruned_loss=0.07707, over 21905.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.3034, pruned_loss=0.07211, over 4285779.43 frames. ], batch size: 372, lr: 4.92e-03, grad_scale: 16.0 2023-06-24 10:42:19,429 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.909e+02 2.526e+02 3.218e+02 4.121e+02 6.447e+02, threshold=6.437e+02, percent-clipped=1.0 2023-06-24 10:42:57,052 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1082778.0, ans=0.0 2023-06-24 10:43:01,525 INFO [train.py:996] (2/4) Epoch 6, batch 28000, loss[loss=0.2306, simple_loss=0.2904, pruned_loss=0.08544, over 19974.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.3019, pruned_loss=0.07018, over 4286025.47 frames. ], batch size: 702, lr: 4.92e-03, grad_scale: 32.0 2023-06-24 10:43:47,394 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1082958.0, ans=0.125 2023-06-24 10:44:20,013 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1083018.0, ans=0.0 2023-06-24 10:44:57,539 INFO [train.py:996] (2/4) Epoch 6, batch 28050, loss[loss=0.1989, simple_loss=0.2689, pruned_loss=0.0644, over 21616.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.2987, pruned_loss=0.07121, over 4285313.36 frames. ], batch size: 230, lr: 4.91e-03, grad_scale: 32.0 2023-06-24 10:45:07,097 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1083138.0, ans=0.125 2023-06-24 10:45:13,649 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1083198.0, ans=0.05 2023-06-24 10:45:15,492 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1083198.0, ans=0.0 2023-06-24 10:45:48,413 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=1083258.0, ans=0.05 2023-06-24 10:46:03,462 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1083318.0, ans=0.1 2023-06-24 10:46:04,461 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.090e+02 2.746e+02 3.083e+02 3.764e+02 7.718e+02, threshold=6.165e+02, percent-clipped=1.0 2023-06-24 10:46:14,460 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.07 vs. limit=15.0 2023-06-24 10:46:28,116 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1083378.0, ans=0.125 2023-06-24 10:46:45,933 INFO [train.py:996] (2/4) Epoch 6, batch 28100, loss[loss=0.2149, simple_loss=0.2747, pruned_loss=0.07752, over 21515.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.2988, pruned_loss=0.07172, over 4285142.79 frames. ], batch size: 414, lr: 4.91e-03, grad_scale: 32.0 2023-06-24 10:47:52,229 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1083618.0, ans=0.125 2023-06-24 10:48:00,995 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1083618.0, ans=0.07 2023-06-24 10:48:17,652 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.88 vs. limit=15.0 2023-06-24 10:48:20,959 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1083678.0, ans=0.0 2023-06-24 10:48:34,053 INFO [train.py:996] (2/4) Epoch 6, batch 28150, loss[loss=0.2166, simple_loss=0.2815, pruned_loss=0.0758, over 14941.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.2936, pruned_loss=0.07164, over 4279778.64 frames. ], batch size: 62, lr: 4.91e-03, grad_scale: 32.0 2023-06-24 10:49:40,086 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.220e+02 2.816e+02 3.227e+02 4.008e+02 8.112e+02, threshold=6.453e+02, percent-clipped=1.0 2023-06-24 10:49:42,807 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1083918.0, ans=0.125 2023-06-24 10:50:24,162 INFO [train.py:996] (2/4) Epoch 6, batch 28200, loss[loss=0.2457, simple_loss=0.3137, pruned_loss=0.08891, over 21703.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.2919, pruned_loss=0.0738, over 4284168.46 frames. ], batch size: 351, lr: 4.91e-03, grad_scale: 32.0 2023-06-24 10:51:22,403 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.14 vs. limit=10.0 2023-06-24 10:52:02,377 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1084278.0, ans=0.0 2023-06-24 10:52:09,246 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1084278.0, ans=0.2 2023-06-24 10:52:11,998 INFO [train.py:996] (2/4) Epoch 6, batch 28250, loss[loss=0.2211, simple_loss=0.2823, pruned_loss=0.07997, over 21604.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.2941, pruned_loss=0.07602, over 4272778.60 frames. ], batch size: 415, lr: 4.91e-03, grad_scale: 32.0 2023-06-24 10:53:30,925 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.085e+02 2.671e+02 3.008e+02 3.478e+02 6.433e+02, threshold=6.015e+02, percent-clipped=0.0 2023-06-24 10:53:35,607 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.01 vs. limit=12.0 2023-06-24 10:53:35,649 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.42 vs. limit=15.0 2023-06-24 10:53:35,756 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.71 vs. limit=15.0 2023-06-24 10:54:03,546 INFO [train.py:996] (2/4) Epoch 6, batch 28300, loss[loss=0.1664, simple_loss=0.2563, pruned_loss=0.03822, over 21758.00 frames. ], tot_loss[loss=0.2212, simple_loss=0.2928, pruned_loss=0.07479, over 4265447.22 frames. ], batch size: 282, lr: 4.91e-03, grad_scale: 32.0 2023-06-24 10:54:36,276 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1084698.0, ans=0.0 2023-06-24 10:54:48,337 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1084758.0, ans=0.1 2023-06-24 10:54:51,914 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1084758.0, ans=0.2 2023-06-24 10:55:37,513 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1084878.0, ans=0.1 2023-06-24 10:55:46,834 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.44 vs. limit=22.5 2023-06-24 10:55:57,143 INFO [train.py:996] (2/4) Epoch 6, batch 28350, loss[loss=0.1833, simple_loss=0.2528, pruned_loss=0.05689, over 21543.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2896, pruned_loss=0.07036, over 4256326.96 frames. ], batch size: 230, lr: 4.91e-03, grad_scale: 32.0 2023-06-24 10:56:01,940 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.15 vs. limit=12.0 2023-06-24 10:56:23,947 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.06 vs. limit=6.0 2023-06-24 10:56:26,936 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.43 vs. limit=15.0 2023-06-24 10:56:46,299 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1085058.0, ans=0.125 2023-06-24 10:57:10,499 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.831e+02 2.270e+02 2.582e+02 2.935e+02 5.064e+02, threshold=5.164e+02, percent-clipped=0.0 2023-06-24 10:57:14,556 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1085118.0, ans=0.2 2023-06-24 10:57:46,652 INFO [train.py:996] (2/4) Epoch 6, batch 28400, loss[loss=0.2072, simple_loss=0.2713, pruned_loss=0.07154, over 21464.00 frames. ], tot_loss[loss=0.212, simple_loss=0.2863, pruned_loss=0.06887, over 4255145.55 frames. ], batch size: 441, lr: 4.91e-03, grad_scale: 32.0 2023-06-24 10:57:54,699 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1085238.0, ans=0.125 2023-06-24 10:58:11,873 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1085298.0, ans=0.125 2023-06-24 10:58:21,160 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1085298.0, ans=0.125 2023-06-24 10:59:03,158 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1085418.0, ans=0.125 2023-06-24 10:59:26,060 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1085478.0, ans=0.0 2023-06-24 10:59:36,153 INFO [train.py:996] (2/4) Epoch 6, batch 28450, loss[loss=0.2243, simple_loss=0.2914, pruned_loss=0.07856, over 21642.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2902, pruned_loss=0.07226, over 4251926.67 frames. ], batch size: 263, lr: 4.91e-03, grad_scale: 32.0 2023-06-24 10:59:47,333 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1085538.0, ans=0.125 2023-06-24 10:59:47,380 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1085538.0, ans=0.07 2023-06-24 11:00:03,106 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1085598.0, ans=0.1 2023-06-24 11:00:34,518 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 11:00:42,797 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.100e+02 2.782e+02 3.154e+02 3.614e+02 5.624e+02, threshold=6.308e+02, percent-clipped=2.0 2023-06-24 11:01:02,305 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1085778.0, ans=0.1 2023-06-24 11:01:09,092 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1085778.0, ans=0.1 2023-06-24 11:01:09,674 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.17 vs. limit=22.5 2023-06-24 11:01:25,040 INFO [train.py:996] (2/4) Epoch 6, batch 28500, loss[loss=0.256, simple_loss=0.3194, pruned_loss=0.09627, over 21430.00 frames. ], tot_loss[loss=0.223, simple_loss=0.2946, pruned_loss=0.0757, over 4263742.23 frames. ], batch size: 194, lr: 4.91e-03, grad_scale: 16.0 2023-06-24 11:01:43,283 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.80 vs. limit=15.0 2023-06-24 11:03:17,205 INFO [train.py:996] (2/4) Epoch 6, batch 28550, loss[loss=0.2809, simple_loss=0.3664, pruned_loss=0.09771, over 21738.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.301, pruned_loss=0.0781, over 4268416.82 frames. ], batch size: 441, lr: 4.91e-03, grad_scale: 16.0 2023-06-24 11:04:37,437 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.267e+02 2.822e+02 3.220e+02 3.775e+02 6.822e+02, threshold=6.440e+02, percent-clipped=1.0 2023-06-24 11:04:58,407 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1086378.0, ans=0.125 2023-06-24 11:05:12,891 INFO [train.py:996] (2/4) Epoch 6, batch 28600, loss[loss=0.2308, simple_loss=0.3095, pruned_loss=0.07603, over 21654.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3078, pruned_loss=0.08065, over 4272496.74 frames. ], batch size: 230, lr: 4.91e-03, grad_scale: 16.0 2023-06-24 11:05:13,664 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1086438.0, ans=0.125 2023-06-24 11:05:37,729 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1086498.0, ans=0.125 2023-06-24 11:06:28,230 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1086618.0, ans=0.2 2023-06-24 11:06:32,129 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.82 vs. limit=6.0 2023-06-24 11:06:36,876 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1086678.0, ans=0.125 2023-06-24 11:06:42,669 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.14 vs. limit=15.0 2023-06-24 11:06:45,307 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1086678.0, ans=0.0 2023-06-24 11:07:06,286 INFO [train.py:996] (2/4) Epoch 6, batch 28650, loss[loss=0.2173, simple_loss=0.2746, pruned_loss=0.08002, over 21579.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.3031, pruned_loss=0.07998, over 4271774.53 frames. ], batch size: 415, lr: 4.91e-03, grad_scale: 16.0 2023-06-24 11:07:07,102 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1086738.0, ans=0.0 2023-06-24 11:07:10,265 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1086738.0, ans=0.125 2023-06-24 11:07:18,224 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.55 vs. limit=15.0 2023-06-24 11:08:12,194 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1086918.0, ans=0.125 2023-06-24 11:08:14,719 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.097e+02 2.693e+02 2.997e+02 3.394e+02 5.567e+02, threshold=5.993e+02, percent-clipped=0.0 2023-06-24 11:08:29,410 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1086978.0, ans=0.125 2023-06-24 11:08:54,850 INFO [train.py:996] (2/4) Epoch 6, batch 28700, loss[loss=0.2491, simple_loss=0.3281, pruned_loss=0.08508, over 21897.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3034, pruned_loss=0.08119, over 4265932.93 frames. ], batch size: 118, lr: 4.91e-03, grad_scale: 16.0 2023-06-24 11:09:11,944 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=1087098.0, ans=15.0 2023-06-24 11:09:23,602 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1087098.0, ans=0.0 2023-06-24 11:09:34,127 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1087158.0, ans=0.0 2023-06-24 11:10:08,052 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1087218.0, ans=0.125 2023-06-24 11:10:42,038 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1087278.0, ans=0.2 2023-06-24 11:10:44,961 INFO [train.py:996] (2/4) Epoch 6, batch 28750, loss[loss=0.2273, simple_loss=0.2803, pruned_loss=0.08714, over 21613.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3021, pruned_loss=0.08072, over 4273816.48 frames. ], batch size: 548, lr: 4.91e-03, grad_scale: 16.0 2023-06-24 11:10:57,518 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1087338.0, ans=0.2 2023-06-24 11:11:30,987 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1087458.0, ans=0.125 2023-06-24 11:11:53,476 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.016e+02 2.664e+02 3.026e+02 3.382e+02 4.910e+02, threshold=6.051e+02, percent-clipped=0.0 2023-06-24 11:12:16,258 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1087578.0, ans=0.04949747468305833 2023-06-24 11:12:33,158 INFO [train.py:996] (2/4) Epoch 6, batch 28800, loss[loss=0.243, simple_loss=0.3144, pruned_loss=0.08582, over 21902.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3061, pruned_loss=0.08085, over 4273968.74 frames. ], batch size: 316, lr: 4.90e-03, grad_scale: 32.0 2023-06-24 11:13:39,238 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.80 vs. limit=15.0 2023-06-24 11:14:03,545 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1087878.0, ans=0.09899494936611666 2023-06-24 11:14:05,359 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1087878.0, ans=0.0 2023-06-24 11:14:11,770 INFO [train.py:996] (2/4) Epoch 6, batch 28850, loss[loss=0.2625, simple_loss=0.3252, pruned_loss=0.09992, over 21817.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3089, pruned_loss=0.08231, over 4274975.72 frames. ], batch size: 441, lr: 4.90e-03, grad_scale: 32.0 2023-06-24 11:14:27,022 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.61 vs. limit=22.5 2023-06-24 11:14:44,032 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1087998.0, ans=0.0 2023-06-24 11:14:47,805 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1087998.0, ans=0.0 2023-06-24 11:14:49,580 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1087998.0, ans=0.125 2023-06-24 11:14:58,279 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1088058.0, ans=0.125 2023-06-24 11:15:25,934 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.267e+02 2.795e+02 3.097e+02 3.558e+02 6.026e+02, threshold=6.195e+02, percent-clipped=0.0 2023-06-24 11:16:01,680 INFO [train.py:996] (2/4) Epoch 6, batch 28900, loss[loss=0.2305, simple_loss=0.3042, pruned_loss=0.0784, over 21613.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3117, pruned_loss=0.08456, over 4281635.36 frames. ], batch size: 263, lr: 4.90e-03, grad_scale: 32.0 2023-06-24 11:16:09,191 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1088238.0, ans=0.0 2023-06-24 11:16:19,975 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1088238.0, ans=0.0 2023-06-24 11:16:32,817 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1088298.0, ans=0.035 2023-06-24 11:18:05,900 INFO [train.py:996] (2/4) Epoch 6, batch 28950, loss[loss=0.1852, simple_loss=0.2478, pruned_loss=0.06123, over 21071.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.311, pruned_loss=0.08396, over 4275150.89 frames. ], batch size: 143, lr: 4.90e-03, grad_scale: 32.0 2023-06-24 11:18:46,670 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1088658.0, ans=0.125 2023-06-24 11:19:09,973 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1088718.0, ans=0.125 2023-06-24 11:19:11,641 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 11:19:18,004 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.152e+02 3.066e+02 3.487e+02 4.356e+02 7.485e+02, threshold=6.974e+02, percent-clipped=4.0 2023-06-24 11:19:18,782 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1088718.0, ans=0.1 2023-06-24 11:19:42,520 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.62 vs. limit=15.0 2023-06-24 11:19:57,034 INFO [train.py:996] (2/4) Epoch 6, batch 29000, loss[loss=0.3197, simple_loss=0.3708, pruned_loss=0.1343, over 21352.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.315, pruned_loss=0.08384, over 4277118.85 frames. ], batch size: 507, lr: 4.90e-03, grad_scale: 16.0 2023-06-24 11:20:02,553 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1088838.0, ans=0.5 2023-06-24 11:20:37,693 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1088898.0, ans=0.0 2023-06-24 11:21:23,581 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.40 vs. limit=6.0 2023-06-24 11:21:38,660 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1089138.0, ans=0.1 2023-06-24 11:21:39,615 INFO [train.py:996] (2/4) Epoch 6, batch 29050, loss[loss=0.2335, simple_loss=0.3026, pruned_loss=0.08224, over 22006.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3138, pruned_loss=0.08316, over 4283151.85 frames. ], batch size: 416, lr: 4.90e-03, grad_scale: 16.0 2023-06-24 11:22:37,963 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1089258.0, ans=0.0 2023-06-24 11:22:59,675 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.070e+02 2.602e+02 2.960e+02 3.468e+02 4.732e+02, threshold=5.920e+02, percent-clipped=0.0 2023-06-24 11:23:26,419 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1089438.0, ans=0.0 2023-06-24 11:23:27,549 INFO [train.py:996] (2/4) Epoch 6, batch 29100, loss[loss=0.2064, simple_loss=0.2689, pruned_loss=0.07198, over 21561.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3052, pruned_loss=0.08065, over 4287578.89 frames. ], batch size: 391, lr: 4.90e-03, grad_scale: 16.0 2023-06-24 11:23:45,718 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1089438.0, ans=0.2 2023-06-24 11:23:55,199 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.36 vs. limit=6.0 2023-06-24 11:24:56,241 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1089678.0, ans=0.125 2023-06-24 11:25:10,510 INFO [train.py:996] (2/4) Epoch 6, batch 29150, loss[loss=0.1811, simple_loss=0.2342, pruned_loss=0.064, over 20678.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3026, pruned_loss=0.07892, over 4286765.94 frames. ], batch size: 608, lr: 4.90e-03, grad_scale: 16.0 2023-06-24 11:25:16,544 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1089738.0, ans=0.2 2023-06-24 11:25:52,007 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1089798.0, ans=0.0 2023-06-24 11:26:12,559 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1089918.0, ans=0.0 2023-06-24 11:26:30,817 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.964e+02 2.501e+02 2.832e+02 3.252e+02 5.475e+02, threshold=5.663e+02, percent-clipped=0.0 2023-06-24 11:26:58,350 INFO [train.py:996] (2/4) Epoch 6, batch 29200, loss[loss=0.2088, simple_loss=0.2668, pruned_loss=0.07541, over 21580.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.2985, pruned_loss=0.0776, over 4288653.38 frames. ], batch size: 298, lr: 4.90e-03, grad_scale: 32.0 2023-06-24 11:27:02,478 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1090038.0, ans=0.0 2023-06-24 11:27:05,796 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1090038.0, ans=0.0 2023-06-24 11:28:12,377 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1090218.0, ans=0.125 2023-06-24 11:28:37,508 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.18 vs. limit=6.0 2023-06-24 11:28:44,325 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1090278.0, ans=0.125 2023-06-24 11:28:47,250 INFO [train.py:996] (2/4) Epoch 6, batch 29250, loss[loss=0.2672, simple_loss=0.3533, pruned_loss=0.09055, over 21577.00 frames. ], tot_loss[loss=0.224, simple_loss=0.2975, pruned_loss=0.07524, over 4278067.21 frames. ], batch size: 441, lr: 4.90e-03, grad_scale: 32.0 2023-06-24 11:28:55,161 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1090338.0, ans=0.0 2023-06-24 11:29:09,720 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.07 vs. limit=10.0 2023-06-24 11:29:26,337 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.90 vs. limit=15.0 2023-06-24 11:30:08,098 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.954e+02 2.479e+02 2.949e+02 4.059e+02 6.998e+02, threshold=5.898e+02, percent-clipped=9.0 2023-06-24 11:30:40,981 INFO [train.py:996] (2/4) Epoch 6, batch 29300, loss[loss=0.2008, simple_loss=0.2648, pruned_loss=0.06843, over 21148.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.2986, pruned_loss=0.07416, over 4276802.15 frames. ], batch size: 143, lr: 4.90e-03, grad_scale: 32.0 2023-06-24 11:31:59,620 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1090818.0, ans=0.0 2023-06-24 11:32:31,128 INFO [train.py:996] (2/4) Epoch 6, batch 29350, loss[loss=0.2074, simple_loss=0.2763, pruned_loss=0.06922, over 21786.00 frames. ], tot_loss[loss=0.2215, simple_loss=0.2954, pruned_loss=0.07377, over 4276012.85 frames. ], batch size: 372, lr: 4.90e-03, grad_scale: 32.0 2023-06-24 11:32:43,610 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1090938.0, ans=0.0 2023-06-24 11:32:54,201 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1090998.0, ans=0.125 2023-06-24 11:32:54,263 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1090998.0, ans=0.0 2023-06-24 11:33:04,773 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1090998.0, ans=0.125 2023-06-24 11:33:43,341 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.001e+02 2.584e+02 3.038e+02 3.610e+02 5.891e+02, threshold=6.076e+02, percent-clipped=0.0 2023-06-24 11:34:22,825 INFO [train.py:996] (2/4) Epoch 6, batch 29400, loss[loss=0.1878, simple_loss=0.2579, pruned_loss=0.0588, over 21766.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.2944, pruned_loss=0.07124, over 4275773.93 frames. ], batch size: 282, lr: 4.90e-03, grad_scale: 32.0 2023-06-24 11:34:28,540 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1091238.0, ans=0.125 2023-06-24 11:34:55,346 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1091298.0, ans=10.0 2023-06-24 11:35:31,384 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.56 vs. limit=15.0 2023-06-24 11:35:43,025 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1091418.0, ans=0.0 2023-06-24 11:36:12,538 INFO [train.py:996] (2/4) Epoch 6, batch 29450, loss[loss=0.2707, simple_loss=0.3429, pruned_loss=0.09925, over 21618.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2926, pruned_loss=0.07067, over 4268399.76 frames. ], batch size: 389, lr: 4.90e-03, grad_scale: 32.0 2023-06-24 11:36:44,035 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1091598.0, ans=0.0 2023-06-24 11:37:17,352 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff3.min_abs, batch_count=1091718.0, ans=0.2 2023-06-24 11:37:26,463 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.944e+02 2.589e+02 2.908e+02 3.358e+02 5.330e+02, threshold=5.817e+02, percent-clipped=0.0 2023-06-24 11:38:00,174 INFO [train.py:996] (2/4) Epoch 6, batch 29500, loss[loss=0.2321, simple_loss=0.2995, pruned_loss=0.08234, over 21855.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.2971, pruned_loss=0.07401, over 4277122.31 frames. ], batch size: 298, lr: 4.90e-03, grad_scale: 32.0 2023-06-24 11:38:09,907 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 11:39:00,031 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1091958.0, ans=0.125 2023-06-24 11:39:23,483 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.28 vs. limit=22.5 2023-06-24 11:39:34,734 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1092078.0, ans=0.125 2023-06-24 11:39:36,394 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1092078.0, ans=0.0 2023-06-24 11:39:49,848 INFO [train.py:996] (2/4) Epoch 6, batch 29550, loss[loss=0.2098, simple_loss=0.2765, pruned_loss=0.07149, over 21763.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.2976, pruned_loss=0.07607, over 4279873.75 frames. ], batch size: 247, lr: 4.89e-03, grad_scale: 32.0 2023-06-24 11:40:30,087 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.46 vs. limit=15.0 2023-06-24 11:40:48,885 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1092258.0, ans=0.125 2023-06-24 11:41:05,559 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.190e+02 2.863e+02 3.307e+02 3.931e+02 5.796e+02, threshold=6.614e+02, percent-clipped=0.0 2023-06-24 11:41:18,704 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1092318.0, ans=0.125 2023-06-24 11:41:27,826 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 11:41:39,641 INFO [train.py:996] (2/4) Epoch 6, batch 29600, loss[loss=0.3074, simple_loss=0.392, pruned_loss=0.1113, over 21289.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3047, pruned_loss=0.07878, over 4282679.72 frames. ], batch size: 548, lr: 4.89e-03, grad_scale: 32.0 2023-06-24 11:42:13,261 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1092498.0, ans=0.125 2023-06-24 11:43:26,417 INFO [train.py:996] (2/4) Epoch 6, batch 29650, loss[loss=0.2476, simple_loss=0.3116, pruned_loss=0.09177, over 21578.00 frames. ], tot_loss[loss=0.227, simple_loss=0.3028, pruned_loss=0.07557, over 4282397.38 frames. ], batch size: 471, lr: 4.89e-03, grad_scale: 32.0 2023-06-24 11:44:14,178 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1092858.0, ans=0.05 2023-06-24 11:44:33,623 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 11:44:47,011 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.627e+02 2.546e+02 3.028e+02 3.755e+02 5.764e+02, threshold=6.055e+02, percent-clipped=0.0 2023-06-24 11:44:58,071 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=1092978.0, ans=0.95 2023-06-24 11:45:08,593 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1092978.0, ans=0.0 2023-06-24 11:45:14,616 INFO [train.py:996] (2/4) Epoch 6, batch 29700, loss[loss=0.2425, simple_loss=0.3202, pruned_loss=0.08242, over 21757.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.302, pruned_loss=0.0753, over 4284506.62 frames. ], batch size: 112, lr: 4.89e-03, grad_scale: 16.0 2023-06-24 11:45:27,761 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.61 vs. limit=22.5 2023-06-24 11:45:53,138 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1093098.0, ans=0.125 2023-06-24 11:46:39,302 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1093218.0, ans=0.2 2023-06-24 11:47:02,449 INFO [train.py:996] (2/4) Epoch 6, batch 29750, loss[loss=0.2242, simple_loss=0.3137, pruned_loss=0.0674, over 21034.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3069, pruned_loss=0.07508, over 4283860.41 frames. ], batch size: 607, lr: 4.89e-03, grad_scale: 16.0 2023-06-24 11:47:15,129 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1093338.0, ans=0.1 2023-06-24 11:47:28,458 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1093398.0, ans=0.125 2023-06-24 11:47:28,475 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1093398.0, ans=0.125 2023-06-24 11:47:33,540 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1093398.0, ans=0.0 2023-06-24 11:48:02,752 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1093458.0, ans=0.0 2023-06-24 11:48:09,408 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1093518.0, ans=0.125 2023-06-24 11:48:23,015 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.842e+02 2.425e+02 2.693e+02 3.074e+02 5.352e+02, threshold=5.385e+02, percent-clipped=0.0 2023-06-24 11:48:54,218 INFO [train.py:996] (2/4) Epoch 6, batch 29800, loss[loss=0.224, simple_loss=0.2954, pruned_loss=0.07626, over 21879.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3078, pruned_loss=0.07549, over 4288949.17 frames. ], batch size: 332, lr: 4.89e-03, grad_scale: 16.0 2023-06-24 11:49:10,715 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1093638.0, ans=0.125 2023-06-24 11:50:34,857 INFO [train.py:996] (2/4) Epoch 6, batch 29850, loss[loss=0.2293, simple_loss=0.3422, pruned_loss=0.05818, over 19750.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.3039, pruned_loss=0.0735, over 4282280.22 frames. ], batch size: 703, lr: 4.89e-03, grad_scale: 16.0 2023-06-24 11:50:57,279 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1093938.0, ans=0.0 2023-06-24 11:51:05,951 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1093998.0, ans=0.07 2023-06-24 11:51:30,117 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1094058.0, ans=0.125 2023-06-24 11:51:55,461 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.130e+02 2.491e+02 2.734e+02 3.399e+02 8.130e+02, threshold=5.469e+02, percent-clipped=4.0 2023-06-24 11:51:56,034 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1094118.0, ans=0.0 2023-06-24 11:52:26,410 INFO [train.py:996] (2/4) Epoch 6, batch 29900, loss[loss=0.2365, simple_loss=0.2945, pruned_loss=0.08929, over 21041.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.3021, pruned_loss=0.07474, over 4285608.97 frames. ], batch size: 608, lr: 4.89e-03, grad_scale: 16.0 2023-06-24 11:52:35,140 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=6.45 vs. limit=22.5 2023-06-24 11:52:52,477 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1094298.0, ans=0.05 2023-06-24 11:52:57,247 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1094298.0, ans=0.1 2023-06-24 11:52:57,253 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1094298.0, ans=0.125 2023-06-24 11:53:38,142 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.91 vs. limit=5.0 2023-06-24 11:53:46,474 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1094418.0, ans=0.1 2023-06-24 11:54:12,973 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1094478.0, ans=0.125 2023-06-24 11:54:23,102 INFO [train.py:996] (2/4) Epoch 6, batch 29950, loss[loss=0.2642, simple_loss=0.3248, pruned_loss=0.1018, over 21376.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.3065, pruned_loss=0.07904, over 4280289.45 frames. ], batch size: 549, lr: 4.89e-03, grad_scale: 16.0 2023-06-24 11:54:32,406 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1094538.0, ans=0.95 2023-06-24 11:55:35,604 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1094718.0, ans=0.0 2023-06-24 11:55:41,531 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.234e+02 2.840e+02 3.123e+02 3.616e+02 5.024e+02, threshold=6.246e+02, percent-clipped=0.0 2023-06-24 11:56:03,811 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1094778.0, ans=0.04949747468305833 2023-06-24 11:56:13,764 INFO [train.py:996] (2/4) Epoch 6, batch 30000, loss[loss=0.227, simple_loss=0.3257, pruned_loss=0.06414, over 21812.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3089, pruned_loss=0.07993, over 4283015.24 frames. ], batch size: 371, lr: 4.89e-03, grad_scale: 32.0 2023-06-24 11:56:13,765 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-24 11:56:34,170 INFO [train.py:1028] (2/4) Epoch 6, validation: loss=0.2459, simple_loss=0.3437, pruned_loss=0.07409, over 1796401.00 frames. 2023-06-24 11:56:34,171 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 23731MB 2023-06-24 11:56:37,235 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1094838.0, ans=0.125 2023-06-24 11:56:37,831 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.09 vs. limit=22.5 2023-06-24 11:57:20,405 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1094958.0, ans=0.125 2023-06-24 11:58:36,160 INFO [train.py:996] (2/4) Epoch 6, batch 30050, loss[loss=0.2381, simple_loss=0.3185, pruned_loss=0.07884, over 21424.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3116, pruned_loss=0.07628, over 4281811.39 frames. ], batch size: 194, lr: 4.89e-03, grad_scale: 32.0 2023-06-24 11:59:13,293 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.05 vs. limit=22.5 2023-06-24 11:59:28,556 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1095258.0, ans=0.2 2023-06-24 11:59:35,422 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1095258.0, ans=0.0 2023-06-24 11:59:55,141 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.808e+02 2.460e+02 2.888e+02 3.811e+02 6.345e+02, threshold=5.776e+02, percent-clipped=1.0 2023-06-24 12:00:24,970 INFO [train.py:996] (2/4) Epoch 6, batch 30100, loss[loss=0.2162, simple_loss=0.2734, pruned_loss=0.07947, over 21175.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.3085, pruned_loss=0.07654, over 4269813.20 frames. ], batch size: 159, lr: 4.89e-03, grad_scale: 16.0 2023-06-24 12:00:25,559 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1095438.0, ans=0.0 2023-06-24 12:02:15,874 INFO [train.py:996] (2/4) Epoch 6, batch 30150, loss[loss=0.2954, simple_loss=0.349, pruned_loss=0.1209, over 21433.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.3052, pruned_loss=0.07782, over 4264213.43 frames. ], batch size: 471, lr: 4.89e-03, grad_scale: 16.0 2023-06-24 12:02:40,653 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1095738.0, ans=0.0 2023-06-24 12:03:11,759 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1095858.0, ans=0.1 2023-06-24 12:03:25,003 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1095858.0, ans=0.0 2023-06-24 12:03:41,258 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1095918.0, ans=0.1 2023-06-24 12:03:44,046 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.176e+02 2.662e+02 2.970e+02 3.572e+02 6.402e+02, threshold=5.941e+02, percent-clipped=1.0 2023-06-24 12:04:19,449 INFO [train.py:996] (2/4) Epoch 6, batch 30200, loss[loss=0.3043, simple_loss=0.3802, pruned_loss=0.1142, over 21427.00 frames. ], tot_loss[loss=0.231, simple_loss=0.3076, pruned_loss=0.07724, over 4268173.38 frames. ], batch size: 507, lr: 4.89e-03, grad_scale: 16.0 2023-06-24 12:06:05,961 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1096278.0, ans=0.0 2023-06-24 12:06:10,624 INFO [train.py:996] (2/4) Epoch 6, batch 30250, loss[loss=0.2813, simple_loss=0.3816, pruned_loss=0.09049, over 21784.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3161, pruned_loss=0.07912, over 4268280.67 frames. ], batch size: 332, lr: 4.89e-03, grad_scale: 16.0 2023-06-24 12:06:12,981 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1096338.0, ans=0.125 2023-06-24 12:06:30,507 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1096338.0, ans=0.1 2023-06-24 12:06:32,153 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1096398.0, ans=0.125 2023-06-24 12:07:27,929 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.281e+02 2.716e+02 3.093e+02 3.619e+02 5.439e+02, threshold=6.186e+02, percent-clipped=0.0 2023-06-24 12:07:51,490 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1096578.0, ans=0.125 2023-06-24 12:07:57,896 INFO [train.py:996] (2/4) Epoch 6, batch 30300, loss[loss=0.2001, simple_loss=0.2596, pruned_loss=0.07027, over 21234.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3137, pruned_loss=0.07907, over 4273839.82 frames. ], batch size: 549, lr: 4.88e-03, grad_scale: 16.0 2023-06-24 12:08:01,983 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1096638.0, ans=0.1 2023-06-24 12:08:08,772 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1096638.0, ans=0.125 2023-06-24 12:08:16,069 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 12:08:21,384 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1096698.0, ans=0.125 2023-06-24 12:08:22,015 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.77 vs. limit=15.0 2023-06-24 12:08:29,222 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=1096698.0, ans=15.0 2023-06-24 12:09:54,016 INFO [train.py:996] (2/4) Epoch 6, batch 30350, loss[loss=0.2288, simple_loss=0.2823, pruned_loss=0.08762, over 20027.00 frames. ], tot_loss[loss=0.2382, simple_loss=0.3151, pruned_loss=0.08064, over 4270694.92 frames. ], batch size: 702, lr: 4.88e-03, grad_scale: 16.0 2023-06-24 12:09:56,688 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1096938.0, ans=0.1 2023-06-24 12:10:25,912 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=19.21 vs. limit=22.5 2023-06-24 12:10:26,563 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1096998.0, ans=0.1 2023-06-24 12:10:49,712 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.97 vs. limit=22.5 2023-06-24 12:10:56,364 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.042e+02 2.694e+02 3.043e+02 3.524e+02 5.331e+02, threshold=6.085e+02, percent-clipped=0.0 2023-06-24 12:11:17,124 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1097178.0, ans=0.0 2023-06-24 12:11:27,900 INFO [train.py:996] (2/4) Epoch 6, batch 30400, loss[loss=0.2271, simple_loss=0.2731, pruned_loss=0.09059, over 20199.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.3068, pruned_loss=0.0785, over 4247358.63 frames. ], batch size: 703, lr: 4.88e-03, grad_scale: 32.0 2023-06-24 12:11:34,438 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1097238.0, ans=0.2 2023-06-24 12:11:36,485 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1097238.0, ans=0.125 2023-06-24 12:11:38,445 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.07 vs. limit=15.0 2023-06-24 12:11:41,190 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1097238.0, ans=0.2 2023-06-24 12:11:42,845 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1097298.0, ans=0.125 2023-06-24 12:12:05,464 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1097358.0, ans=0.2 2023-06-24 12:12:08,527 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1097358.0, ans=0.0 2023-06-24 12:12:17,932 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1097418.0, ans=0.125 2023-06-24 12:12:57,189 INFO [train.py:996] (2/4) Epoch 6, batch 30450, loss[loss=0.2728, simple_loss=0.3927, pruned_loss=0.07643, over 19770.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.3077, pruned_loss=0.07845, over 4190999.28 frames. ], batch size: 702, lr: 4.88e-03, grad_scale: 32.0 2023-06-24 12:13:02,327 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 12:13:42,517 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1097658.0, ans=0.125 2023-06-24 12:13:56,609 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.978e+02 4.419e+02 5.663e+02 8.899e+02 2.204e+03, threshold=1.133e+03, percent-clipped=46.0 2023-06-24 12:14:02,709 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.38 vs. limit=6.0 2023-06-24 12:16:21,120 INFO [train.py:996] (2/4) Epoch 7, batch 0, loss[loss=0.2355, simple_loss=0.3109, pruned_loss=0.08, over 21852.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.3109, pruned_loss=0.08, over 21852.00 frames. ], batch size: 107, lr: 4.48e-03, grad_scale: 32.0 2023-06-24 12:16:21,121 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-24 12:16:38,602 INFO [train.py:1028] (2/4) Epoch 7, validation: loss=0.2421, simple_loss=0.346, pruned_loss=0.0691, over 1796401.00 frames. 2023-06-24 12:16:38,603 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 23731MB 2023-06-24 12:18:03,540 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1097982.0, ans=0.0 2023-06-24 12:18:18,859 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1098042.0, ans=0.125 2023-06-24 12:18:25,399 INFO [train.py:996] (2/4) Epoch 7, batch 50, loss[loss=0.2775, simple_loss=0.388, pruned_loss=0.08348, over 21724.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3112, pruned_loss=0.07618, over 960586.14 frames. ], batch size: 247, lr: 4.48e-03, grad_scale: 32.0 2023-06-24 12:19:13,781 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1098222.0, ans=0.125 2023-06-24 12:19:30,582 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1098222.0, ans=0.125 2023-06-24 12:20:01,487 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.157e+02 2.689e+02 3.085e+02 3.734e+02 9.044e+02, threshold=6.169e+02, percent-clipped=0.0 2023-06-24 12:20:13,728 INFO [train.py:996] (2/4) Epoch 7, batch 100, loss[loss=0.2593, simple_loss=0.3583, pruned_loss=0.08011, over 21390.00 frames. ], tot_loss[loss=0.2434, simple_loss=0.3275, pruned_loss=0.07961, over 1683309.62 frames. ], batch size: 194, lr: 4.48e-03, grad_scale: 16.0 2023-06-24 12:20:39,859 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1098462.0, ans=0.0 2023-06-24 12:20:40,267 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.16 vs. limit=15.0 2023-06-24 12:22:00,469 INFO [train.py:996] (2/4) Epoch 7, batch 150, loss[loss=0.2021, simple_loss=0.2839, pruned_loss=0.06021, over 21217.00 frames. ], tot_loss[loss=0.2422, simple_loss=0.3264, pruned_loss=0.07898, over 2248453.96 frames. ], batch size: 159, lr: 4.48e-03, grad_scale: 16.0 2023-06-24 12:22:23,126 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1098762.0, ans=0.2 2023-06-24 12:23:03,770 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1098822.0, ans=0.125 2023-06-24 12:23:36,483 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.136e+02 2.604e+02 2.896e+02 3.363e+02 6.379e+02, threshold=5.792e+02, percent-clipped=1.0 2023-06-24 12:23:42,769 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.72 vs. limit=12.0 2023-06-24 12:23:45,269 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1098942.0, ans=0.125 2023-06-24 12:23:47,927 INFO [train.py:996] (2/4) Epoch 7, batch 200, loss[loss=0.2673, simple_loss=0.3512, pruned_loss=0.09175, over 21728.00 frames. ], tot_loss[loss=0.2403, simple_loss=0.3239, pruned_loss=0.07835, over 2688711.23 frames. ], batch size: 351, lr: 4.48e-03, grad_scale: 16.0 2023-06-24 12:24:42,134 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1099122.0, ans=0.0 2023-06-24 12:24:57,069 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1099182.0, ans=0.125 2023-06-24 12:24:57,668 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.25 vs. limit=15.0 2023-06-24 12:25:30,594 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.48 vs. limit=10.0 2023-06-24 12:25:36,652 INFO [train.py:996] (2/4) Epoch 7, batch 250, loss[loss=0.2119, simple_loss=0.2785, pruned_loss=0.07263, over 21779.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.3195, pruned_loss=0.076, over 3041440.76 frames. ], batch size: 124, lr: 4.48e-03, grad_scale: 16.0 2023-06-24 12:25:44,382 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=11.01 vs. limit=12.0 2023-06-24 12:25:47,405 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1099302.0, ans=0.1 2023-06-24 12:26:09,374 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1099362.0, ans=0.125 2023-06-24 12:26:50,419 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1099482.0, ans=0.05 2023-06-24 12:26:56,017 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.06 vs. limit=15.0 2023-06-24 12:27:14,904 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.025e+02 2.504e+02 2.848e+02 3.185e+02 4.478e+02, threshold=5.696e+02, percent-clipped=0.0 2023-06-24 12:27:15,488 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1099542.0, ans=0.125 2023-06-24 12:27:27,349 INFO [train.py:996] (2/4) Epoch 7, batch 300, loss[loss=0.2035, simple_loss=0.274, pruned_loss=0.06649, over 21329.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.3124, pruned_loss=0.07538, over 3304776.92 frames. ], batch size: 176, lr: 4.48e-03, grad_scale: 16.0 2023-06-24 12:28:32,425 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1099722.0, ans=0.1 2023-06-24 12:29:18,776 INFO [train.py:996] (2/4) Epoch 7, batch 350, loss[loss=0.1848, simple_loss=0.2509, pruned_loss=0.05931, over 21557.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.3056, pruned_loss=0.07513, over 3525738.83 frames. ], batch size: 247, lr: 4.48e-03, grad_scale: 16.0 2023-06-24 12:30:54,175 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1100142.0, ans=0.1 2023-06-24 12:30:57,932 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1100142.0, ans=0.125 2023-06-24 12:30:58,783 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.904e+02 2.718e+02 3.112e+02 3.692e+02 6.265e+02, threshold=6.224e+02, percent-clipped=2.0 2023-06-24 12:31:01,979 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.13 vs. limit=15.0 2023-06-24 12:31:11,304 INFO [train.py:996] (2/4) Epoch 7, batch 400, loss[loss=0.1939, simple_loss=0.2644, pruned_loss=0.06165, over 21586.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.2998, pruned_loss=0.07449, over 3681381.96 frames. ], batch size: 332, lr: 4.48e-03, grad_scale: 32.0 2023-06-24 12:31:13,801 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1100202.0, ans=0.0 2023-06-24 12:32:14,522 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.82 vs. limit=15.0 2023-06-24 12:32:17,560 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1100322.0, ans=0.125 2023-06-24 12:32:53,639 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1100442.0, ans=0.125 2023-06-24 12:33:02,124 INFO [train.py:996] (2/4) Epoch 7, batch 450, loss[loss=0.2356, simple_loss=0.3156, pruned_loss=0.07776, over 21893.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.2997, pruned_loss=0.07403, over 3810179.23 frames. ], batch size: 316, lr: 4.48e-03, grad_scale: 16.0 2023-06-24 12:33:10,100 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1100502.0, ans=0.125 2023-06-24 12:33:15,418 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.56 vs. limit=15.0 2023-06-24 12:33:22,690 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.94 vs. limit=15.0 2023-06-24 12:33:57,336 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1100622.0, ans=0.125 2023-06-24 12:34:13,174 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1100622.0, ans=0.2 2023-06-24 12:34:23,117 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.12 vs. limit=10.0 2023-06-24 12:34:40,557 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.890e+02 2.615e+02 3.361e+02 4.061e+02 5.988e+02, threshold=6.722e+02, percent-clipped=0.0 2023-06-24 12:34:57,183 INFO [train.py:996] (2/4) Epoch 7, batch 500, loss[loss=0.2832, simple_loss=0.3706, pruned_loss=0.09786, over 21850.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.2993, pruned_loss=0.07303, over 3903195.49 frames. ], batch size: 371, lr: 4.48e-03, grad_scale: 16.0 2023-06-24 12:36:25,533 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1101042.0, ans=0.125 2023-06-24 12:36:46,107 INFO [train.py:996] (2/4) Epoch 7, batch 550, loss[loss=0.1935, simple_loss=0.2781, pruned_loss=0.05445, over 21159.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.2967, pruned_loss=0.07196, over 3977635.97 frames. ], batch size: 548, lr: 4.48e-03, grad_scale: 16.0 2023-06-24 12:37:15,791 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.62 vs. limit=15.0 2023-06-24 12:37:24,896 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1101222.0, ans=0.0 2023-06-24 12:37:45,850 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1101222.0, ans=0.0 2023-06-24 12:37:59,509 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1101282.0, ans=0.025 2023-06-24 12:38:14,377 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.022e+02 2.641e+02 3.136e+02 3.627e+02 5.437e+02, threshold=6.272e+02, percent-clipped=0.0 2023-06-24 12:38:25,771 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 12:38:27,405 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1101402.0, ans=0.1 2023-06-24 12:38:28,519 INFO [train.py:996] (2/4) Epoch 7, batch 600, loss[loss=0.2374, simple_loss=0.3319, pruned_loss=0.07151, over 21649.00 frames. ], tot_loss[loss=0.223, simple_loss=0.3022, pruned_loss=0.07189, over 4048362.96 frames. ], batch size: 263, lr: 4.48e-03, grad_scale: 8.0 2023-06-24 12:38:49,094 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1101402.0, ans=0.125 2023-06-24 12:38:50,612 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1101462.0, ans=0.125 2023-06-24 12:39:28,698 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1101522.0, ans=0.0 2023-06-24 12:39:33,883 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1101522.0, ans=0.5 2023-06-24 12:39:49,384 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1101582.0, ans=0.0 2023-06-24 12:40:03,046 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1101642.0, ans=0.1 2023-06-24 12:40:16,851 INFO [train.py:996] (2/4) Epoch 7, batch 650, loss[loss=0.236, simple_loss=0.2979, pruned_loss=0.08706, over 21859.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.3045, pruned_loss=0.07251, over 4094622.59 frames. ], batch size: 414, lr: 4.48e-03, grad_scale: 8.0 2023-06-24 12:40:23,096 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.whiten.whitening_limit, batch_count=1101702.0, ans=12.0 2023-06-24 12:40:24,291 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1101702.0, ans=0.2 2023-06-24 12:40:27,835 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1101702.0, ans=0.0 2023-06-24 12:41:18,493 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.60 vs. limit=15.0 2023-06-24 12:41:51,783 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.212e+02 2.716e+02 3.087e+02 3.645e+02 5.920e+02, threshold=6.175e+02, percent-clipped=0.0 2023-06-24 12:42:05,951 INFO [train.py:996] (2/4) Epoch 7, batch 700, loss[loss=0.2374, simple_loss=0.3071, pruned_loss=0.08384, over 21917.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.3063, pruned_loss=0.07422, over 4144258.41 frames. ], batch size: 351, lr: 4.48e-03, grad_scale: 8.0 2023-06-24 12:42:33,242 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1102062.0, ans=0.0 2023-06-24 12:42:40,035 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1102062.0, ans=0.0 2023-06-24 12:43:50,065 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1102242.0, ans=0.125 2023-06-24 12:43:59,381 INFO [train.py:996] (2/4) Epoch 7, batch 750, loss[loss=0.2243, simple_loss=0.2872, pruned_loss=0.08068, over 15024.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.3073, pruned_loss=0.07556, over 4168004.92 frames. ], batch size: 60, lr: 4.48e-03, grad_scale: 8.0 2023-06-24 12:44:05,459 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1102302.0, ans=0.125 2023-06-24 12:44:30,259 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1102362.0, ans=0.125 2023-06-24 12:45:01,663 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1102482.0, ans=0.125 2023-06-24 12:45:06,586 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1102482.0, ans=0.0 2023-06-24 12:45:28,800 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.965e+02 2.942e+02 3.385e+02 4.235e+02 7.679e+02, threshold=6.771e+02, percent-clipped=3.0 2023-06-24 12:45:38,490 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1102542.0, ans=0.2 2023-06-24 12:45:43,008 INFO [train.py:996] (2/4) Epoch 7, batch 800, loss[loss=0.2633, simple_loss=0.3516, pruned_loss=0.08751, over 21716.00 frames. ], tot_loss[loss=0.2271, simple_loss=0.3041, pruned_loss=0.07508, over 4194130.90 frames. ], batch size: 298, lr: 4.48e-03, grad_scale: 16.0 2023-06-24 12:46:12,371 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 12:46:42,920 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.00 vs. limit=15.0 2023-06-24 12:46:59,605 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1102782.0, ans=0.1 2023-06-24 12:47:01,795 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1102782.0, ans=0.0 2023-06-24 12:47:01,837 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1102782.0, ans=0.2 2023-06-24 12:47:38,974 INFO [train.py:996] (2/4) Epoch 7, batch 850, loss[loss=0.2112, simple_loss=0.28, pruned_loss=0.07124, over 21945.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.3022, pruned_loss=0.07473, over 4213600.49 frames. ], batch size: 316, lr: 4.47e-03, grad_scale: 16.0 2023-06-24 12:47:39,621 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1102902.0, ans=0.125 2023-06-24 12:48:38,493 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1103022.0, ans=0.1 2023-06-24 12:49:07,526 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.999e+02 2.715e+02 3.192e+02 3.563e+02 7.547e+02, threshold=6.383e+02, percent-clipped=1.0 2023-06-24 12:49:27,453 INFO [train.py:996] (2/4) Epoch 7, batch 900, loss[loss=0.2261, simple_loss=0.2809, pruned_loss=0.08559, over 21204.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.2971, pruned_loss=0.07385, over 4232382.69 frames. ], batch size: 176, lr: 4.47e-03, grad_scale: 16.0 2023-06-24 12:49:55,763 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1103262.0, ans=0.125 2023-06-24 12:50:54,677 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.99 vs. limit=22.5 2023-06-24 12:51:17,600 INFO [train.py:996] (2/4) Epoch 7, batch 950, loss[loss=0.2953, simple_loss=0.3528, pruned_loss=0.1189, over 21449.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.2966, pruned_loss=0.07463, over 4243814.23 frames. ], batch size: 507, lr: 4.47e-03, grad_scale: 16.0 2023-06-24 12:52:46,921 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1103742.0, ans=0.125 2023-06-24 12:52:56,908 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.54 vs. limit=10.0 2023-06-24 12:52:59,202 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.008e+02 2.594e+02 2.897e+02 3.337e+02 7.292e+02, threshold=5.794e+02, percent-clipped=1.0 2023-06-24 12:53:07,731 INFO [train.py:996] (2/4) Epoch 7, batch 1000, loss[loss=0.2092, simple_loss=0.279, pruned_loss=0.0697, over 21755.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.2971, pruned_loss=0.07502, over 4256250.15 frames. ], batch size: 247, lr: 4.47e-03, grad_scale: 16.0 2023-06-24 12:53:18,868 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.41 vs. limit=15.0 2023-06-24 12:53:31,015 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.76 vs. limit=15.0 2023-06-24 12:54:08,577 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.87 vs. limit=22.5 2023-06-24 12:54:22,184 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1103982.0, ans=0.2 2023-06-24 12:54:47,673 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 12:55:05,153 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1104102.0, ans=0.125 2023-06-24 12:55:12,147 INFO [train.py:996] (2/4) Epoch 7, batch 1050, loss[loss=0.2521, simple_loss=0.325, pruned_loss=0.08963, over 21822.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.2942, pruned_loss=0.07376, over 4262121.85 frames. ], batch size: 247, lr: 4.47e-03, grad_scale: 16.0 2023-06-24 12:56:24,014 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1104282.0, ans=0.125 2023-06-24 12:56:32,003 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.19 vs. limit=15.0 2023-06-24 12:56:38,168 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 12:56:42,630 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.988e+02 2.809e+02 3.239e+02 3.685e+02 6.477e+02, threshold=6.478e+02, percent-clipped=3.0 2023-06-24 12:56:56,483 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1104402.0, ans=0.0 2023-06-24 12:56:57,497 INFO [train.py:996] (2/4) Epoch 7, batch 1100, loss[loss=0.2324, simple_loss=0.2999, pruned_loss=0.08242, over 21415.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.2944, pruned_loss=0.07409, over 4263227.49 frames. ], batch size: 194, lr: 4.47e-03, grad_scale: 16.0 2023-06-24 12:57:20,614 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1104462.0, ans=0.125 2023-06-24 12:58:11,116 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1104582.0, ans=0.0 2023-06-24 12:58:38,368 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1104642.0, ans=0.125 2023-06-24 12:58:48,079 INFO [train.py:996] (2/4) Epoch 7, batch 1150, loss[loss=0.2384, simple_loss=0.3165, pruned_loss=0.08021, over 21299.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.2952, pruned_loss=0.07326, over 4265358.12 frames. ], batch size: 131, lr: 4.47e-03, grad_scale: 16.0 2023-06-24 12:58:55,897 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1104702.0, ans=0.125 2023-06-24 12:59:06,783 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1104762.0, ans=0.1 2023-06-24 12:59:11,153 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.38 vs. limit=15.0 2023-06-24 12:59:36,095 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1104822.0, ans=0.125 2023-06-24 12:59:37,731 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1104822.0, ans=0.0 2023-06-24 13:00:30,288 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.093e+02 2.493e+02 2.841e+02 3.361e+02 6.236e+02, threshold=5.682e+02, percent-clipped=0.0 2023-06-24 13:00:38,723 INFO [train.py:996] (2/4) Epoch 7, batch 1200, loss[loss=0.2746, simple_loss=0.3589, pruned_loss=0.09519, over 21927.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.2984, pruned_loss=0.07432, over 4276936.17 frames. ], batch size: 372, lr: 4.47e-03, grad_scale: 32.0 2023-06-24 13:00:46,414 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1105002.0, ans=0.05 2023-06-24 13:00:53,166 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1105002.0, ans=0.05 2023-06-24 13:01:06,866 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1105062.0, ans=0.125 2023-06-24 13:01:14,202 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1105122.0, ans=0.1 2023-06-24 13:01:27,639 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1105122.0, ans=0.1 2023-06-24 13:01:40,388 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1105182.0, ans=0.125 2023-06-24 13:01:43,764 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1105182.0, ans=0.0 2023-06-24 13:02:06,480 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.56 vs. limit=10.0 2023-06-24 13:02:14,811 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1105242.0, ans=0.125 2023-06-24 13:02:25,981 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.17 vs. limit=15.0 2023-06-24 13:02:28,499 INFO [train.py:996] (2/4) Epoch 7, batch 1250, loss[loss=0.2894, simple_loss=0.3448, pruned_loss=0.117, over 21645.00 frames. ], tot_loss[loss=0.2257, simple_loss=0.3012, pruned_loss=0.07507, over 4280849.49 frames. ], batch size: 507, lr: 4.47e-03, grad_scale: 32.0 2023-06-24 13:02:46,178 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.14 vs. limit=10.0 2023-06-24 13:03:45,961 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1105482.0, ans=0.125 2023-06-24 13:03:46,009 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1105482.0, ans=0.2 2023-06-24 13:04:09,147 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.126e+02 2.694e+02 3.114e+02 3.849e+02 5.488e+02, threshold=6.227e+02, percent-clipped=0.0 2023-06-24 13:04:18,080 INFO [train.py:996] (2/4) Epoch 7, batch 1300, loss[loss=0.3395, simple_loss=0.4088, pruned_loss=0.1351, over 21524.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.3026, pruned_loss=0.07435, over 4275239.40 frames. ], batch size: 507, lr: 4.47e-03, grad_scale: 32.0 2023-06-24 13:04:18,626 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1105602.0, ans=0.1 2023-06-24 13:04:31,320 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1105602.0, ans=0.1 2023-06-24 13:04:39,425 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1105662.0, ans=0.1 2023-06-24 13:04:58,876 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1105722.0, ans=0.0 2023-06-24 13:06:06,841 INFO [train.py:996] (2/4) Epoch 7, batch 1350, loss[loss=0.2067, simple_loss=0.2826, pruned_loss=0.06538, over 21775.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.3023, pruned_loss=0.07408, over 4280061.93 frames. ], batch size: 102, lr: 4.47e-03, grad_scale: 32.0 2023-06-24 13:06:28,885 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1105962.0, ans=0.1 2023-06-24 13:07:47,686 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.48 vs. limit=15.0 2023-06-24 13:07:48,034 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.180e+02 2.498e+02 2.809e+02 3.151e+02 4.941e+02, threshold=5.617e+02, percent-clipped=0.0 2023-06-24 13:07:53,412 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1106142.0, ans=0.0 2023-06-24 13:07:56,339 INFO [train.py:996] (2/4) Epoch 7, batch 1400, loss[loss=0.189, simple_loss=0.2589, pruned_loss=0.05953, over 21207.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.3001, pruned_loss=0.074, over 4280940.72 frames. ], batch size: 607, lr: 4.47e-03, grad_scale: 32.0 2023-06-24 13:08:19,579 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1106262.0, ans=0.125 2023-06-24 13:09:29,387 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.90 vs. limit=12.0 2023-06-24 13:09:33,039 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.01 vs. limit=6.0 2023-06-24 13:09:39,245 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1106442.0, ans=0.125 2023-06-24 13:09:46,130 INFO [train.py:996] (2/4) Epoch 7, batch 1450, loss[loss=0.2241, simple_loss=0.3066, pruned_loss=0.07076, over 21799.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.3011, pruned_loss=0.0744, over 4286135.63 frames. ], batch size: 124, lr: 4.47e-03, grad_scale: 16.0 2023-06-24 13:10:58,105 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1106682.0, ans=0.125 2023-06-24 13:11:13,366 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1106682.0, ans=0.125 2023-06-24 13:11:28,956 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.014e+02 2.734e+02 3.228e+02 3.700e+02 6.613e+02, threshold=6.455e+02, percent-clipped=4.0 2023-06-24 13:11:36,330 INFO [train.py:996] (2/4) Epoch 7, batch 1500, loss[loss=0.2021, simple_loss=0.2645, pruned_loss=0.06983, over 21441.00 frames. ], tot_loss[loss=0.227, simple_loss=0.3028, pruned_loss=0.07563, over 4289463.55 frames. ], batch size: 389, lr: 4.47e-03, grad_scale: 16.0 2023-06-24 13:11:43,786 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1106802.0, ans=0.125 2023-06-24 13:12:38,273 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1106922.0, ans=0.2 2023-06-24 13:12:53,605 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 13:12:57,233 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1106982.0, ans=0.125 2023-06-24 13:13:03,086 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1106982.0, ans=0.1 2023-06-24 13:13:05,145 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1107042.0, ans=0.0 2023-06-24 13:13:24,206 INFO [train.py:996] (2/4) Epoch 7, batch 1550, loss[loss=0.2398, simple_loss=0.3148, pruned_loss=0.08237, over 20731.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.3019, pruned_loss=0.07528, over 4285832.68 frames. ], batch size: 607, lr: 4.47e-03, grad_scale: 16.0 2023-06-24 13:13:51,475 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1107162.0, ans=0.2 2023-06-24 13:14:49,427 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 13:15:03,661 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1107342.0, ans=0.2 2023-06-24 13:15:06,853 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.150e+02 2.619e+02 3.008e+02 3.656e+02 5.850e+02, threshold=6.017e+02, percent-clipped=0.0 2023-06-24 13:15:13,476 INFO [train.py:996] (2/4) Epoch 7, batch 1600, loss[loss=0.2925, simple_loss=0.3664, pruned_loss=0.1092, over 21634.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.3019, pruned_loss=0.07526, over 4281585.16 frames. ], batch size: 414, lr: 4.47e-03, grad_scale: 32.0 2023-06-24 13:15:16,123 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1107402.0, ans=0.0 2023-06-24 13:15:35,891 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1107462.0, ans=0.125 2023-06-24 13:15:47,182 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1107462.0, ans=0.0 2023-06-24 13:16:24,647 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1107522.0, ans=0.125 2023-06-24 13:16:29,767 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=1107582.0, ans=15.0 2023-06-24 13:17:11,099 INFO [train.py:996] (2/4) Epoch 7, batch 1650, loss[loss=0.2113, simple_loss=0.2872, pruned_loss=0.06774, over 21676.00 frames. ], tot_loss[loss=0.2248, simple_loss=0.301, pruned_loss=0.07432, over 4277022.04 frames. ], batch size: 332, lr: 4.46e-03, grad_scale: 32.0 2023-06-24 13:18:40,457 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1107942.0, ans=0.0 2023-06-24 13:18:55,910 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.937e+02 2.769e+02 3.129e+02 3.705e+02 6.024e+02, threshold=6.259e+02, percent-clipped=1.0 2023-06-24 13:19:02,551 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1108002.0, ans=0.015 2023-06-24 13:19:03,709 INFO [train.py:996] (2/4) Epoch 7, batch 1700, loss[loss=0.2168, simple_loss=0.2915, pruned_loss=0.07105, over 21617.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.3036, pruned_loss=0.07488, over 4285846.44 frames. ], batch size: 263, lr: 4.46e-03, grad_scale: 32.0 2023-06-24 13:19:30,978 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1108002.0, ans=0.2 2023-06-24 13:21:02,721 INFO [train.py:996] (2/4) Epoch 7, batch 1750, loss[loss=0.1489, simple_loss=0.215, pruned_loss=0.04139, over 21271.00 frames. ], tot_loss[loss=0.225, simple_loss=0.302, pruned_loss=0.074, over 4286607.95 frames. ], batch size: 143, lr: 4.46e-03, grad_scale: 16.0 2023-06-24 13:21:25,204 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1108302.0, ans=0.1 2023-06-24 13:21:32,352 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1108362.0, ans=0.2 2023-06-24 13:22:14,878 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1108482.0, ans=0.125 2023-06-24 13:22:27,740 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1108482.0, ans=0.0 2023-06-24 13:22:56,564 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.880e+02 2.765e+02 3.316e+02 4.331e+02 7.357e+02, threshold=6.632e+02, percent-clipped=3.0 2023-06-24 13:22:59,493 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.whiten.whitening_limit, batch_count=1108542.0, ans=12.0 2023-06-24 13:23:07,044 INFO [train.py:996] (2/4) Epoch 7, batch 1800, loss[loss=0.2528, simple_loss=0.3286, pruned_loss=0.08846, over 21389.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.3019, pruned_loss=0.07258, over 4281238.38 frames. ], batch size: 549, lr: 4.46e-03, grad_scale: 16.0 2023-06-24 13:23:07,892 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1108602.0, ans=0.1 2023-06-24 13:23:14,993 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1108602.0, ans=0.125 2023-06-24 13:23:37,575 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.55 vs. limit=15.0 2023-06-24 13:23:47,962 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1108722.0, ans=0.125 2023-06-24 13:24:07,477 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 13:24:29,037 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1108842.0, ans=0.2 2023-06-24 13:24:52,485 INFO [train.py:996] (2/4) Epoch 7, batch 1850, loss[loss=0.1696, simple_loss=0.2518, pruned_loss=0.04369, over 21431.00 frames. ], tot_loss[loss=0.2214, simple_loss=0.3011, pruned_loss=0.07087, over 4281076.15 frames. ], batch size: 211, lr: 4.46e-03, grad_scale: 8.0 2023-06-24 13:25:21,264 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=1108962.0, ans=22.5 2023-06-24 13:25:25,068 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.99 vs. limit=22.5 2023-06-24 13:26:00,444 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1109082.0, ans=0.0 2023-06-24 13:26:38,963 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.824e+02 2.879e+02 3.449e+02 4.316e+02 7.592e+02, threshold=6.898e+02, percent-clipped=3.0 2023-06-24 13:26:47,942 INFO [train.py:996] (2/4) Epoch 7, batch 1900, loss[loss=0.1922, simple_loss=0.2618, pruned_loss=0.06129, over 21372.00 frames. ], tot_loss[loss=0.2219, simple_loss=0.3012, pruned_loss=0.07126, over 4277907.29 frames. ], batch size: 194, lr: 4.46e-03, grad_scale: 8.0 2023-06-24 13:26:52,362 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.73 vs. limit=15.0 2023-06-24 13:27:02,370 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1109202.0, ans=0.125 2023-06-24 13:27:12,802 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1109262.0, ans=0.125 2023-06-24 13:27:16,494 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1109262.0, ans=0.025 2023-06-24 13:27:43,539 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.13 vs. limit=15.0 2023-06-24 13:27:44,738 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1109382.0, ans=0.125 2023-06-24 13:28:16,367 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1109442.0, ans=0.125 2023-06-24 13:28:38,159 INFO [train.py:996] (2/4) Epoch 7, batch 1950, loss[loss=0.2158, simple_loss=0.2724, pruned_loss=0.07957, over 21833.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.2972, pruned_loss=0.0711, over 4280832.99 frames. ], batch size: 98, lr: 4.46e-03, grad_scale: 8.0 2023-06-24 13:28:48,517 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.14 vs. limit=15.0 2023-06-24 13:28:54,535 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1109562.0, ans=0.0 2023-06-24 13:28:55,152 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.66 vs. limit=15.0 2023-06-24 13:30:26,491 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.057e+02 2.664e+02 3.137e+02 3.840e+02 6.499e+02, threshold=6.275e+02, percent-clipped=0.0 2023-06-24 13:30:29,925 INFO [train.py:996] (2/4) Epoch 7, batch 2000, loss[loss=0.1699, simple_loss=0.246, pruned_loss=0.04688, over 21588.00 frames. ], tot_loss[loss=0.2158, simple_loss=0.2931, pruned_loss=0.06923, over 4283008.81 frames. ], batch size: 230, lr: 4.46e-03, grad_scale: 16.0 2023-06-24 13:30:42,742 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1109802.0, ans=0.1 2023-06-24 13:32:20,837 INFO [train.py:996] (2/4) Epoch 7, batch 2050, loss[loss=0.168, simple_loss=0.2442, pruned_loss=0.04588, over 21511.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.2922, pruned_loss=0.0683, over 4278709.47 frames. ], batch size: 212, lr: 4.46e-03, grad_scale: 16.0 2023-06-24 13:34:07,177 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.959e+02 2.697e+02 3.083e+02 3.787e+02 7.892e+02, threshold=6.165e+02, percent-clipped=1.0 2023-06-24 13:34:07,887 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1110342.0, ans=0.0 2023-06-24 13:34:10,773 INFO [train.py:996] (2/4) Epoch 7, batch 2100, loss[loss=0.2399, simple_loss=0.3019, pruned_loss=0.08891, over 21330.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.2959, pruned_loss=0.06984, over 4279849.57 frames. ], batch size: 471, lr: 4.46e-03, grad_scale: 16.0 2023-06-24 13:34:48,777 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1110462.0, ans=0.125 2023-06-24 13:35:57,799 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1110642.0, ans=0.1 2023-06-24 13:36:02,104 INFO [train.py:996] (2/4) Epoch 7, batch 2150, loss[loss=0.237, simple_loss=0.3066, pruned_loss=0.08371, over 21227.00 frames. ], tot_loss[loss=0.2212, simple_loss=0.2986, pruned_loss=0.07193, over 4278941.29 frames. ], batch size: 159, lr: 4.46e-03, grad_scale: 16.0 2023-06-24 13:36:20,928 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1110762.0, ans=0.125 2023-06-24 13:36:35,602 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1110762.0, ans=0.05 2023-06-24 13:36:50,952 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.80 vs. limit=15.0 2023-06-24 13:37:30,637 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1110942.0, ans=10.0 2023-06-24 13:37:32,412 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1110942.0, ans=0.125 2023-06-24 13:37:49,021 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.267e+02 2.806e+02 3.490e+02 4.529e+02 7.299e+02, threshold=6.981e+02, percent-clipped=4.0 2023-06-24 13:37:52,591 INFO [train.py:996] (2/4) Epoch 7, batch 2200, loss[loss=0.1911, simple_loss=0.2627, pruned_loss=0.05977, over 21225.00 frames. ], tot_loss[loss=0.2219, simple_loss=0.3006, pruned_loss=0.07159, over 4271786.57 frames. ], batch size: 159, lr: 4.46e-03, grad_scale: 16.0 2023-06-24 13:38:17,375 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1111062.0, ans=0.125 2023-06-24 13:39:01,175 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1111182.0, ans=0.125 2023-06-24 13:39:08,936 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=1111182.0, ans=15.0 2023-06-24 13:39:40,164 INFO [train.py:996] (2/4) Epoch 7, batch 2250, loss[loss=0.2127, simple_loss=0.28, pruned_loss=0.07273, over 21706.00 frames. ], tot_loss[loss=0.2181, simple_loss=0.2964, pruned_loss=0.06989, over 4267583.09 frames. ], batch size: 316, lr: 4.46e-03, grad_scale: 16.0 2023-06-24 13:40:51,065 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1111482.0, ans=0.1 2023-06-24 13:41:06,654 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1111542.0, ans=0.2 2023-06-24 13:41:24,943 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.962e+02 2.731e+02 3.125e+02 3.958e+02 6.138e+02, threshold=6.249e+02, percent-clipped=0.0 2023-06-24 13:41:28,587 INFO [train.py:996] (2/4) Epoch 7, batch 2300, loss[loss=0.1886, simple_loss=0.2572, pruned_loss=0.06003, over 21653.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2935, pruned_loss=0.06949, over 4270487.18 frames. ], batch size: 333, lr: 4.46e-03, grad_scale: 16.0 2023-06-24 13:41:43,090 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1111602.0, ans=0.0 2023-06-24 13:41:49,893 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1111662.0, ans=0.0 2023-06-24 13:42:25,980 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1111722.0, ans=0.1 2023-06-24 13:42:27,733 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1111722.0, ans=0.1 2023-06-24 13:42:47,126 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1111782.0, ans=0.0 2023-06-24 13:43:17,670 INFO [train.py:996] (2/4) Epoch 7, batch 2350, loss[loss=0.2038, simple_loss=0.2668, pruned_loss=0.07034, over 21724.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.2938, pruned_loss=0.07093, over 4261209.03 frames. ], batch size: 351, lr: 4.46e-03, grad_scale: 16.0 2023-06-24 13:43:31,997 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1111902.0, ans=0.1 2023-06-24 13:43:48,625 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.52 vs. limit=15.0 2023-06-24 13:44:16,217 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1112022.0, ans=0.125 2023-06-24 13:44:45,203 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.11 vs. limit=15.0 2023-06-24 13:44:58,983 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1112142.0, ans=0.125 2023-06-24 13:45:05,379 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.026e+02 2.752e+02 3.198e+02 3.763e+02 6.793e+02, threshold=6.396e+02, percent-clipped=2.0 2023-06-24 13:45:08,866 INFO [train.py:996] (2/4) Epoch 7, batch 2400, loss[loss=0.2696, simple_loss=0.3377, pruned_loss=0.1008, over 21602.00 frames. ], tot_loss[loss=0.2219, simple_loss=0.297, pruned_loss=0.07342, over 4265264.59 frames. ], batch size: 415, lr: 4.46e-03, grad_scale: 32.0 2023-06-24 13:45:32,281 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.07 vs. limit=15.0 2023-06-24 13:45:44,002 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1112262.0, ans=0.0 2023-06-24 13:46:54,065 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1112442.0, ans=0.2 2023-06-24 13:46:59,005 INFO [train.py:996] (2/4) Epoch 7, batch 2450, loss[loss=0.2044, simple_loss=0.2745, pruned_loss=0.06719, over 21628.00 frames. ], tot_loss[loss=0.2257, simple_loss=0.3012, pruned_loss=0.07508, over 4270525.94 frames. ], batch size: 332, lr: 4.46e-03, grad_scale: 32.0 2023-06-24 13:48:16,011 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1112682.0, ans=0.125 2023-06-24 13:48:21,236 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1112682.0, ans=10.0 2023-06-24 13:48:43,604 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1112742.0, ans=0.125 2023-06-24 13:48:48,359 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.212e+02 2.774e+02 3.648e+02 4.607e+02 7.858e+02, threshold=7.296e+02, percent-clipped=5.0 2023-06-24 13:48:51,813 INFO [train.py:996] (2/4) Epoch 7, batch 2500, loss[loss=0.2186, simple_loss=0.2717, pruned_loss=0.08274, over 21547.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.2975, pruned_loss=0.07392, over 4264005.60 frames. ], batch size: 442, lr: 4.45e-03, grad_scale: 32.0 2023-06-24 13:49:56,966 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.14 vs. limit=15.0 2023-06-24 13:50:00,133 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1112982.0, ans=0.125 2023-06-24 13:50:32,271 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1113042.0, ans=0.0 2023-06-24 13:50:35,828 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1113042.0, ans=0.125 2023-06-24 13:50:42,243 INFO [train.py:996] (2/4) Epoch 7, batch 2550, loss[loss=0.2265, simple_loss=0.296, pruned_loss=0.07856, over 21313.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.2975, pruned_loss=0.07389, over 4265409.45 frames. ], batch size: 131, lr: 4.45e-03, grad_scale: 32.0 2023-06-24 13:52:00,709 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1113282.0, ans=0.2 2023-06-24 13:52:30,408 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.150e+02 2.860e+02 3.358e+02 4.176e+02 6.278e+02, threshold=6.716e+02, percent-clipped=0.0 2023-06-24 13:52:32,021 INFO [train.py:996] (2/4) Epoch 7, batch 2600, loss[loss=0.2478, simple_loss=0.3142, pruned_loss=0.09072, over 21576.00 frames. ], tot_loss[loss=0.2257, simple_loss=0.3, pruned_loss=0.0757, over 4262682.36 frames. ], batch size: 263, lr: 4.45e-03, grad_scale: 16.0 2023-06-24 13:53:48,027 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1113582.0, ans=0.0 2023-06-24 13:54:23,113 INFO [train.py:996] (2/4) Epoch 7, batch 2650, loss[loss=0.2701, simple_loss=0.3218, pruned_loss=0.1092, over 21695.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.2999, pruned_loss=0.07588, over 4267730.88 frames. ], batch size: 508, lr: 4.45e-03, grad_scale: 16.0 2023-06-24 13:54:26,284 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.99 vs. limit=22.5 2023-06-24 13:55:31,299 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1113822.0, ans=0.1 2023-06-24 13:56:12,127 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.196e+02 2.624e+02 3.107e+02 3.655e+02 6.528e+02, threshold=6.215e+02, percent-clipped=0.0 2023-06-24 13:56:14,283 INFO [train.py:996] (2/4) Epoch 7, batch 2700, loss[loss=0.238, simple_loss=0.3083, pruned_loss=0.08384, over 21871.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.2983, pruned_loss=0.0747, over 4275017.09 frames. ], batch size: 124, lr: 4.45e-03, grad_scale: 16.0 2023-06-24 13:56:16,542 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1114002.0, ans=0.0 2023-06-24 13:56:17,065 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.89 vs. limit=15.0 2023-06-24 13:56:23,260 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1114002.0, ans=0.1 2023-06-24 13:57:30,208 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1114182.0, ans=0.125 2023-06-24 13:58:04,707 INFO [train.py:996] (2/4) Epoch 7, batch 2750, loss[loss=0.2769, simple_loss=0.3432, pruned_loss=0.1053, over 21727.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.2969, pruned_loss=0.07467, over 4280651.21 frames. ], batch size: 441, lr: 4.45e-03, grad_scale: 16.0 2023-06-24 13:58:25,777 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1114302.0, ans=0.125 2023-06-24 13:58:31,260 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1114362.0, ans=0.0 2023-06-24 13:59:26,190 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1114482.0, ans=0.0 2023-06-24 13:59:40,862 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.95 vs. limit=15.0 2023-06-24 13:59:44,864 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.53 vs. limit=22.5 2023-06-24 13:59:48,030 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1114542.0, ans=0.125 2023-06-24 13:59:49,964 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1114542.0, ans=0.1 2023-06-24 14:00:01,268 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.337e+02 2.964e+02 3.229e+02 3.808e+02 6.340e+02, threshold=6.458e+02, percent-clipped=1.0 2023-06-24 14:00:03,066 INFO [train.py:996] (2/4) Epoch 7, batch 2800, loss[loss=0.2266, simple_loss=0.3043, pruned_loss=0.0744, over 21632.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.2999, pruned_loss=0.07546, over 4285921.64 frames. ], batch size: 230, lr: 4.45e-03, grad_scale: 32.0 2023-06-24 14:00:11,909 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.43 vs. limit=15.0 2023-06-24 14:00:23,669 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1114602.0, ans=0.5 2023-06-24 14:00:40,567 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.35 vs. limit=22.5 2023-06-24 14:00:43,694 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1114662.0, ans=0.0 2023-06-24 14:01:08,991 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1114722.0, ans=0.0 2023-06-24 14:01:26,892 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.94 vs. limit=15.0 2023-06-24 14:01:33,147 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1114842.0, ans=0.1 2023-06-24 14:01:54,116 INFO [train.py:996] (2/4) Epoch 7, batch 2850, loss[loss=0.2581, simple_loss=0.3298, pruned_loss=0.09318, over 21752.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.3036, pruned_loss=0.07766, over 4282227.53 frames. ], batch size: 441, lr: 4.45e-03, grad_scale: 32.0 2023-06-24 14:01:56,784 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1114902.0, ans=0.125 2023-06-24 14:02:33,705 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1114962.0, ans=0.125 2023-06-24 14:03:03,426 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.30 vs. limit=15.0 2023-06-24 14:03:10,022 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1115082.0, ans=0.125 2023-06-24 14:03:20,532 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1115142.0, ans=0.0 2023-06-24 14:03:42,916 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.437e+02 2.854e+02 3.316e+02 3.985e+02 8.556e+02, threshold=6.632e+02, percent-clipped=4.0 2023-06-24 14:03:42,947 INFO [train.py:996] (2/4) Epoch 7, batch 2900, loss[loss=0.2429, simple_loss=0.3071, pruned_loss=0.08939, over 21736.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3026, pruned_loss=0.07812, over 4285772.55 frames. ], batch size: 389, lr: 4.45e-03, grad_scale: 16.0 2023-06-24 14:03:43,482 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1115202.0, ans=0.125 2023-06-24 14:03:46,689 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1115202.0, ans=0.04949747468305833 2023-06-24 14:04:17,301 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1115262.0, ans=0.0 2023-06-24 14:04:42,447 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1115322.0, ans=0.125 2023-06-24 14:04:49,412 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1115322.0, ans=0.07 2023-06-24 14:05:25,413 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1115442.0, ans=0.1 2023-06-24 14:05:33,572 INFO [train.py:996] (2/4) Epoch 7, batch 2950, loss[loss=0.2303, simple_loss=0.3217, pruned_loss=0.06946, over 21797.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3027, pruned_loss=0.07801, over 4285405.36 frames. ], batch size: 247, lr: 4.45e-03, grad_scale: 16.0 2023-06-24 14:05:37,860 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1115502.0, ans=0.0 2023-06-24 14:05:49,151 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1115502.0, ans=0.125 2023-06-24 14:05:55,624 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1115562.0, ans=0.2 2023-06-24 14:05:55,710 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1115562.0, ans=0.125 2023-06-24 14:05:59,150 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1115562.0, ans=0.125 2023-06-24 14:07:24,640 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.073e+02 2.857e+02 3.209e+02 3.929e+02 8.381e+02, threshold=6.419e+02, percent-clipped=2.0 2023-06-24 14:07:24,675 INFO [train.py:996] (2/4) Epoch 7, batch 3000, loss[loss=0.3171, simple_loss=0.3716, pruned_loss=0.1314, over 21385.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.3065, pruned_loss=0.07867, over 4288849.74 frames. ], batch size: 508, lr: 4.45e-03, grad_scale: 16.0 2023-06-24 14:07:24,676 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-24 14:07:46,566 INFO [train.py:1028] (2/4) Epoch 7, validation: loss=0.2481, simple_loss=0.3407, pruned_loss=0.0778, over 1796401.00 frames. 2023-06-24 14:07:46,567 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 23731MB 2023-06-24 14:07:47,623 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1115802.0, ans=0.1 2023-06-24 14:08:03,673 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1115802.0, ans=0.1 2023-06-24 14:08:41,319 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1115922.0, ans=0.0 2023-06-24 14:09:07,211 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.07 vs. limit=10.0 2023-06-24 14:09:30,941 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1116042.0, ans=0.125 2023-06-24 14:09:32,528 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1116042.0, ans=0.125 2023-06-24 14:09:37,555 INFO [train.py:996] (2/4) Epoch 7, batch 3050, loss[loss=0.169, simple_loss=0.255, pruned_loss=0.04147, over 21451.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.3068, pruned_loss=0.07647, over 4282278.54 frames. ], batch size: 194, lr: 4.45e-03, grad_scale: 16.0 2023-06-24 14:09:59,843 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1116162.0, ans=0.0 2023-06-24 14:10:15,902 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.59 vs. limit=15.0 2023-06-24 14:10:31,541 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1116222.0, ans=0.2 2023-06-24 14:10:39,178 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1116222.0, ans=0.125 2023-06-24 14:11:33,742 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.863e+02 2.533e+02 2.915e+02 3.819e+02 6.639e+02, threshold=5.830e+02, percent-clipped=1.0 2023-06-24 14:11:33,774 INFO [train.py:996] (2/4) Epoch 7, batch 3100, loss[loss=0.2219, simple_loss=0.3167, pruned_loss=0.06359, over 21665.00 frames. ], tot_loss[loss=0.2271, simple_loss=0.305, pruned_loss=0.07463, over 4283018.70 frames. ], batch size: 263, lr: 4.45e-03, grad_scale: 16.0 2023-06-24 14:12:00,515 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1116462.0, ans=0.1 2023-06-24 14:12:35,105 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1116522.0, ans=0.0 2023-06-24 14:13:25,683 INFO [train.py:996] (2/4) Epoch 7, batch 3150, loss[loss=0.2584, simple_loss=0.3435, pruned_loss=0.08667, over 21787.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.3054, pruned_loss=0.07497, over 4273198.68 frames. ], batch size: 124, lr: 4.45e-03, grad_scale: 16.0 2023-06-24 14:14:07,334 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1116762.0, ans=0.0 2023-06-24 14:14:46,880 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1116882.0, ans=0.5 2023-06-24 14:14:54,691 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1116882.0, ans=0.1 2023-06-24 14:15:07,267 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1116942.0, ans=0.1 2023-06-24 14:15:17,773 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1116942.0, ans=0.2 2023-06-24 14:15:22,164 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.040e+02 2.712e+02 3.098e+02 3.534e+02 5.991e+02, threshold=6.196e+02, percent-clipped=1.0 2023-06-24 14:15:22,196 INFO [train.py:996] (2/4) Epoch 7, batch 3200, loss[loss=0.22, simple_loss=0.3144, pruned_loss=0.06282, over 21724.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.3065, pruned_loss=0.07566, over 4271138.36 frames. ], batch size: 389, lr: 4.45e-03, grad_scale: 32.0 2023-06-24 14:15:24,602 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1117002.0, ans=0.0 2023-06-24 14:15:27,387 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.85 vs. limit=10.0 2023-06-24 14:15:33,739 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1117002.0, ans=0.0 2023-06-24 14:16:19,309 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1117122.0, ans=0.2 2023-06-24 14:17:13,203 INFO [train.py:996] (2/4) Epoch 7, batch 3250, loss[loss=0.2185, simple_loss=0.2865, pruned_loss=0.07527, over 21748.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.307, pruned_loss=0.07741, over 4266841.42 frames. ], batch size: 124, lr: 4.45e-03, grad_scale: 16.0 2023-06-24 14:18:45,606 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=1117482.0, ans=15.0 2023-06-24 14:19:05,843 INFO [train.py:996] (2/4) Epoch 7, batch 3300, loss[loss=0.2087, simple_loss=0.3096, pruned_loss=0.05389, over 21677.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.301, pruned_loss=0.07604, over 4274310.91 frames. ], batch size: 298, lr: 4.45e-03, grad_scale: 16.0 2023-06-24 14:19:07,518 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.143e+02 2.718e+02 3.384e+02 4.609e+02 8.476e+02, threshold=6.767e+02, percent-clipped=13.0 2023-06-24 14:19:34,534 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1117662.0, ans=0.125 2023-06-24 14:19:51,802 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.04 vs. limit=15.0 2023-06-24 14:20:28,603 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1117782.0, ans=0.125 2023-06-24 14:20:28,661 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1117782.0, ans=0.1 2023-06-24 14:20:46,563 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1117842.0, ans=0.125 2023-06-24 14:20:52,255 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1117842.0, ans=0.04949747468305833 2023-06-24 14:20:56,361 INFO [train.py:996] (2/4) Epoch 7, batch 3350, loss[loss=0.2042, simple_loss=0.2974, pruned_loss=0.05547, over 21305.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.3035, pruned_loss=0.07594, over 4280557.15 frames. ], batch size: 211, lr: 4.44e-03, grad_scale: 16.0 2023-06-24 14:22:18,627 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1118082.0, ans=0.125 2023-06-24 14:22:27,745 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1118082.0, ans=0.2 2023-06-24 14:22:29,492 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1118142.0, ans=0.2 2023-06-24 14:22:53,117 INFO [train.py:996] (2/4) Epoch 7, batch 3400, loss[loss=0.2208, simple_loss=0.3084, pruned_loss=0.06666, over 21341.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3038, pruned_loss=0.0768, over 4287322.10 frames. ], batch size: 176, lr: 4.44e-03, grad_scale: 16.0 2023-06-24 14:22:54,755 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.113e+02 2.810e+02 3.179e+02 3.983e+02 5.568e+02, threshold=6.357e+02, percent-clipped=0.0 2023-06-24 14:23:09,093 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1118262.0, ans=0.1 2023-06-24 14:24:38,607 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 14:24:43,511 INFO [train.py:996] (2/4) Epoch 7, batch 3450, loss[loss=0.2203, simple_loss=0.3181, pruned_loss=0.06121, over 20860.00 frames. ], tot_loss[loss=0.226, simple_loss=0.2996, pruned_loss=0.07623, over 4285522.19 frames. ], batch size: 607, lr: 4.44e-03, grad_scale: 16.0 2023-06-24 14:25:00,009 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.46 vs. limit=15.0 2023-06-24 14:25:48,981 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.11 vs. limit=15.0 2023-06-24 14:25:57,702 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1118682.0, ans=0.125 2023-06-24 14:26:22,931 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1118742.0, ans=0.035 2023-06-24 14:26:36,942 INFO [train.py:996] (2/4) Epoch 7, batch 3500, loss[loss=0.2598, simple_loss=0.3413, pruned_loss=0.08916, over 21707.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3103, pruned_loss=0.07978, over 4284543.31 frames. ], batch size: 298, lr: 4.44e-03, grad_scale: 16.0 2023-06-24 14:26:38,659 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.230e+02 2.694e+02 2.966e+02 3.710e+02 5.580e+02, threshold=5.932e+02, percent-clipped=0.0 2023-06-24 14:27:11,700 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1118862.0, ans=0.125 2023-06-24 14:28:21,207 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1119042.0, ans=0.125 2023-06-24 14:28:33,485 INFO [train.py:996] (2/4) Epoch 7, batch 3550, loss[loss=0.2252, simple_loss=0.3004, pruned_loss=0.07498, over 21739.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3124, pruned_loss=0.08051, over 4289763.02 frames. ], batch size: 351, lr: 4.44e-03, grad_scale: 16.0 2023-06-24 14:29:18,029 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1119222.0, ans=0.0 2023-06-24 14:29:21,984 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1119222.0, ans=0.125 2023-06-24 14:29:27,698 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.96 vs. limit=15.0 2023-06-24 14:30:01,576 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1119342.0, ans=0.2 2023-06-24 14:30:21,530 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1119342.0, ans=0.125 2023-06-24 14:30:24,495 INFO [train.py:996] (2/4) Epoch 7, batch 3600, loss[loss=0.2489, simple_loss=0.3142, pruned_loss=0.0918, over 21843.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3067, pruned_loss=0.08002, over 4287534.56 frames. ], batch size: 118, lr: 4.44e-03, grad_scale: 32.0 2023-06-24 14:30:31,401 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.346e+02 2.873e+02 3.282e+02 3.993e+02 6.971e+02, threshold=6.565e+02, percent-clipped=2.0 2023-06-24 14:30:48,708 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1119462.0, ans=0.125 2023-06-24 14:31:13,611 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1119522.0, ans=0.0 2023-06-24 14:31:19,375 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.38 vs. limit=15.0 2023-06-24 14:31:51,477 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1119642.0, ans=0.125 2023-06-24 14:32:22,685 INFO [train.py:996] (2/4) Epoch 7, batch 3650, loss[loss=0.2203, simple_loss=0.2893, pruned_loss=0.07563, over 21762.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3082, pruned_loss=0.08014, over 4285165.47 frames. ], batch size: 124, lr: 4.44e-03, grad_scale: 32.0 2023-06-24 14:32:48,924 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1119762.0, ans=0.0 2023-06-24 14:33:06,373 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1119822.0, ans=0.0 2023-06-24 14:33:22,204 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1119882.0, ans=0.1 2023-06-24 14:34:06,794 INFO [train.py:996] (2/4) Epoch 7, batch 3700, loss[loss=0.2264, simple_loss=0.3079, pruned_loss=0.07246, over 21869.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.3083, pruned_loss=0.08, over 4286422.88 frames. ], batch size: 118, lr: 4.44e-03, grad_scale: 32.0 2023-06-24 14:34:08,352 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.135e+02 2.822e+02 3.276e+02 3.785e+02 7.589e+02, threshold=6.551e+02, percent-clipped=1.0 2023-06-24 14:34:17,943 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1120002.0, ans=0.125 2023-06-24 14:34:48,137 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1120062.0, ans=0.0 2023-06-24 14:35:02,632 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.22 vs. limit=15.0 2023-06-24 14:36:02,189 INFO [train.py:996] (2/4) Epoch 7, batch 3750, loss[loss=0.1953, simple_loss=0.2831, pruned_loss=0.05375, over 21019.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.3058, pruned_loss=0.07853, over 4290310.86 frames. ], batch size: 608, lr: 4.44e-03, grad_scale: 32.0 2023-06-24 14:37:02,119 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.87 vs. limit=10.0 2023-06-24 14:37:57,856 INFO [train.py:996] (2/4) Epoch 7, batch 3800, loss[loss=0.2598, simple_loss=0.3352, pruned_loss=0.09216, over 21571.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.3044, pruned_loss=0.07751, over 4282703.43 frames. ], batch size: 415, lr: 4.44e-03, grad_scale: 16.0 2023-06-24 14:38:01,823 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.853e+02 2.713e+02 3.064e+02 3.470e+02 5.470e+02, threshold=6.128e+02, percent-clipped=0.0 2023-06-24 14:38:25,316 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1120662.0, ans=0.125 2023-06-24 14:38:54,899 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.80 vs. limit=15.0 2023-06-24 14:39:14,634 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.84 vs. limit=10.0 2023-06-24 14:39:35,768 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1120842.0, ans=0.125 2023-06-24 14:39:49,651 INFO [train.py:996] (2/4) Epoch 7, batch 3850, loss[loss=0.1995, simple_loss=0.2559, pruned_loss=0.07151, over 21149.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3018, pruned_loss=0.07783, over 4275949.78 frames. ], batch size: 143, lr: 4.44e-03, grad_scale: 16.0 2023-06-24 14:40:08,603 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1120962.0, ans=0.1 2023-06-24 14:41:33,235 INFO [train.py:996] (2/4) Epoch 7, batch 3900, loss[loss=0.2312, simple_loss=0.3004, pruned_loss=0.08095, over 21635.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.297, pruned_loss=0.0773, over 4273635.32 frames. ], batch size: 389, lr: 4.44e-03, grad_scale: 16.0 2023-06-24 14:41:36,446 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.175e+02 2.710e+02 3.145e+02 3.584e+02 6.226e+02, threshold=6.291e+02, percent-clipped=1.0 2023-06-24 14:41:46,399 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1121202.0, ans=0.125 2023-06-24 14:41:55,788 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.29 vs. limit=15.0 2023-06-24 14:41:58,632 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff3.min_abs, batch_count=1121262.0, ans=0.2 2023-06-24 14:43:23,673 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.49 vs. limit=15.0 2023-06-24 14:43:31,392 INFO [train.py:996] (2/4) Epoch 7, batch 3950, loss[loss=0.1876, simple_loss=0.2681, pruned_loss=0.05353, over 21342.00 frames. ], tot_loss[loss=0.226, simple_loss=0.2989, pruned_loss=0.07653, over 4275116.90 frames. ], batch size: 131, lr: 4.44e-03, grad_scale: 16.0 2023-06-24 14:44:17,422 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1121622.0, ans=0.0 2023-06-24 14:44:28,439 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1121622.0, ans=0.2 2023-06-24 14:44:39,042 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1121682.0, ans=0.2 2023-06-24 14:45:03,814 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1121742.0, ans=0.125 2023-06-24 14:45:18,422 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1121742.0, ans=0.125 2023-06-24 14:45:22,802 INFO [train.py:996] (2/4) Epoch 7, batch 4000, loss[loss=0.1925, simple_loss=0.2587, pruned_loss=0.06309, over 21501.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.293, pruned_loss=0.07292, over 4275991.60 frames. ], batch size: 230, lr: 4.44e-03, grad_scale: 32.0 2023-06-24 14:45:26,593 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.070e+02 2.561e+02 2.888e+02 3.482e+02 6.063e+02, threshold=5.775e+02, percent-clipped=0.0 2023-06-24 14:45:43,380 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1121802.0, ans=0.125 2023-06-24 14:46:41,127 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1121982.0, ans=0.125 2023-06-24 14:46:51,655 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1122042.0, ans=0.025 2023-06-24 14:47:13,468 INFO [train.py:996] (2/4) Epoch 7, batch 4050, loss[loss=0.2266, simple_loss=0.2962, pruned_loss=0.07845, over 21167.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2924, pruned_loss=0.07122, over 4270566.95 frames. ], batch size: 608, lr: 4.44e-03, grad_scale: 32.0 2023-06-24 14:47:22,550 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1122102.0, ans=0.0 2023-06-24 14:47:41,924 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1122162.0, ans=0.1 2023-06-24 14:47:49,593 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1122162.0, ans=0.125 2023-06-24 14:47:59,606 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1122222.0, ans=0.2 2023-06-24 14:49:04,264 INFO [train.py:996] (2/4) Epoch 7, batch 4100, loss[loss=0.2138, simple_loss=0.2871, pruned_loss=0.07028, over 21643.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.293, pruned_loss=0.07144, over 4277859.98 frames. ], batch size: 230, lr: 4.44e-03, grad_scale: 16.0 2023-06-24 14:49:08,897 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.185e+02 2.546e+02 2.998e+02 3.545e+02 8.551e+02, threshold=5.997e+02, percent-clipped=3.0 2023-06-24 14:49:41,624 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1122462.0, ans=0.0 2023-06-24 14:50:20,011 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.10 vs. limit=15.0 2023-06-24 14:50:41,479 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1122642.0, ans=0.125 2023-06-24 14:50:54,051 INFO [train.py:996] (2/4) Epoch 7, batch 4150, loss[loss=0.2042, simple_loss=0.2897, pruned_loss=0.05939, over 21757.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2931, pruned_loss=0.06852, over 4278495.15 frames. ], batch size: 316, lr: 4.44e-03, grad_scale: 16.0 2023-06-24 14:51:11,036 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1122702.0, ans=0.05 2023-06-24 14:51:19,634 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1122762.0, ans=0.2 2023-06-24 14:51:39,890 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.54 vs. limit=12.0 2023-06-24 14:51:56,739 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1122822.0, ans=0.2 2023-06-24 14:52:46,870 INFO [train.py:996] (2/4) Epoch 7, batch 4200, loss[loss=0.2263, simple_loss=0.2999, pruned_loss=0.07637, over 21529.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.2922, pruned_loss=0.06847, over 4270355.91 frames. ], batch size: 389, lr: 4.43e-03, grad_scale: 16.0 2023-06-24 14:52:57,888 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.766e+02 2.672e+02 2.976e+02 3.504e+02 5.360e+02, threshold=5.952e+02, percent-clipped=0.0 2023-06-24 14:54:05,758 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1123182.0, ans=0.125 2023-06-24 14:54:45,227 INFO [train.py:996] (2/4) Epoch 7, batch 4250, loss[loss=0.2672, simple_loss=0.3367, pruned_loss=0.09886, over 21489.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.2964, pruned_loss=0.07029, over 4264282.59 frames. ], batch size: 194, lr: 4.43e-03, grad_scale: 16.0 2023-06-24 14:54:59,745 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1123302.0, ans=0.125 2023-06-24 14:55:41,563 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1123422.0, ans=0.125 2023-06-24 14:56:02,084 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1123482.0, ans=0.0 2023-06-24 14:56:21,010 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.18 vs. limit=12.0 2023-06-24 14:56:42,757 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 14:56:43,630 INFO [train.py:996] (2/4) Epoch 7, batch 4300, loss[loss=0.1917, simple_loss=0.2392, pruned_loss=0.07213, over 20760.00 frames. ], tot_loss[loss=0.224, simple_loss=0.3029, pruned_loss=0.07256, over 4266202.73 frames. ], batch size: 609, lr: 4.43e-03, grad_scale: 16.0 2023-06-24 14:56:47,684 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1123602.0, ans=0.125 2023-06-24 14:56:48,732 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.245e+02 3.063e+02 3.693e+02 4.827e+02 7.345e+02, threshold=7.385e+02, percent-clipped=7.0 2023-06-24 14:56:58,058 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1123602.0, ans=0.0 2023-06-24 14:57:05,377 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1123662.0, ans=0.2 2023-06-24 14:57:33,591 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1123722.0, ans=0.0 2023-06-24 14:57:48,268 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1123782.0, ans=0.0 2023-06-24 14:58:04,377 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1123782.0, ans=0.125 2023-06-24 14:58:04,399 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1123782.0, ans=0.2 2023-06-24 14:58:39,573 INFO [train.py:996] (2/4) Epoch 7, batch 4350, loss[loss=0.2326, simple_loss=0.312, pruned_loss=0.07662, over 21597.00 frames. ], tot_loss[loss=0.222, simple_loss=0.3009, pruned_loss=0.07154, over 4266484.48 frames. ], batch size: 414, lr: 4.43e-03, grad_scale: 16.0 2023-06-24 14:58:43,759 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1123902.0, ans=0.0 2023-06-24 14:59:09,570 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1123962.0, ans=0.0 2023-06-24 14:59:25,228 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1124022.0, ans=0.0 2023-06-24 14:59:59,770 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1124142.0, ans=0.125 2023-06-24 15:00:27,591 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1124142.0, ans=0.125 2023-06-24 15:00:35,516 INFO [train.py:996] (2/4) Epoch 7, batch 4400, loss[loss=0.1969, simple_loss=0.2739, pruned_loss=0.05996, over 21786.00 frames. ], tot_loss[loss=0.22, simple_loss=0.2968, pruned_loss=0.07159, over 4274572.33 frames. ], batch size: 112, lr: 4.43e-03, grad_scale: 32.0 2023-06-24 15:00:40,375 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1124202.0, ans=0.125 2023-06-24 15:00:41,292 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.078e+02 2.891e+02 3.329e+02 4.006e+02 7.259e+02, threshold=6.659e+02, percent-clipped=0.0 2023-06-24 15:01:32,113 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1124322.0, ans=0.1 2023-06-24 15:01:40,297 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1124382.0, ans=0.0 2023-06-24 15:01:56,485 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1124382.0, ans=0.0 2023-06-24 15:02:28,256 INFO [train.py:996] (2/4) Epoch 7, batch 4450, loss[loss=0.2313, simple_loss=0.3085, pruned_loss=0.07701, over 21434.00 frames. ], tot_loss[loss=0.227, simple_loss=0.3061, pruned_loss=0.074, over 4279945.38 frames. ], batch size: 194, lr: 4.43e-03, grad_scale: 32.0 2023-06-24 15:03:12,169 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.77 vs. limit=10.0 2023-06-24 15:03:40,396 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1124682.0, ans=0.125 2023-06-24 15:03:43,429 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1124682.0, ans=0.0 2023-06-24 15:04:20,242 INFO [train.py:996] (2/4) Epoch 7, batch 4500, loss[loss=0.2299, simple_loss=0.3019, pruned_loss=0.07894, over 21415.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.3067, pruned_loss=0.07507, over 4287804.95 frames. ], batch size: 144, lr: 4.43e-03, grad_scale: 32.0 2023-06-24 15:04:25,146 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.118e+02 2.933e+02 3.595e+02 4.328e+02 6.220e+02, threshold=7.189e+02, percent-clipped=0.0 2023-06-24 15:05:47,365 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.61 vs. limit=15.0 2023-06-24 15:06:10,772 INFO [train.py:996] (2/4) Epoch 7, batch 4550, loss[loss=0.246, simple_loss=0.3279, pruned_loss=0.08199, over 21742.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.3106, pruned_loss=0.07558, over 4285025.59 frames. ], batch size: 332, lr: 4.43e-03, grad_scale: 32.0 2023-06-24 15:06:22,316 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1125102.0, ans=0.125 2023-06-24 15:06:25,821 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1125102.0, ans=0.0 2023-06-24 15:07:27,722 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1125282.0, ans=0.0 2023-06-24 15:07:56,748 INFO [train.py:996] (2/4) Epoch 7, batch 4600, loss[loss=0.2274, simple_loss=0.3097, pruned_loss=0.0725, over 21856.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.3114, pruned_loss=0.07593, over 4279194.81 frames. ], batch size: 124, lr: 4.43e-03, grad_scale: 32.0 2023-06-24 15:07:57,681 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1125402.0, ans=0.125 2023-06-24 15:08:02,272 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.231e+02 3.095e+02 3.765e+02 5.007e+02 9.113e+02, threshold=7.530e+02, percent-clipped=6.0 2023-06-24 15:08:22,709 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1125462.0, ans=0.1 2023-06-24 15:08:36,637 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1125522.0, ans=0.125 2023-06-24 15:09:15,203 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1125582.0, ans=0.0 2023-06-24 15:09:22,275 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1125582.0, ans=0.1 2023-06-24 15:09:41,449 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.23 vs. limit=6.0 2023-06-24 15:09:44,653 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.03 vs. limit=15.0 2023-06-24 15:09:45,802 INFO [train.py:996] (2/4) Epoch 7, batch 4650, loss[loss=0.236, simple_loss=0.299, pruned_loss=0.08649, over 21575.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.3065, pruned_loss=0.07469, over 4280844.85 frames. ], batch size: 471, lr: 4.43e-03, grad_scale: 32.0 2023-06-24 15:10:11,460 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1125762.0, ans=0.07 2023-06-24 15:10:44,687 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.62 vs. limit=15.0 2023-06-24 15:11:35,510 INFO [train.py:996] (2/4) Epoch 7, batch 4700, loss[loss=0.2078, simple_loss=0.2703, pruned_loss=0.0727, over 21572.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.2975, pruned_loss=0.0726, over 4275576.93 frames. ], batch size: 391, lr: 4.43e-03, grad_scale: 32.0 2023-06-24 15:11:45,722 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.908e+02 2.572e+02 2.876e+02 3.232e+02 6.204e+02, threshold=5.752e+02, percent-clipped=0.0 2023-06-24 15:11:46,452 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1126002.0, ans=0.0 2023-06-24 15:12:28,257 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1126122.0, ans=0.125 2023-06-24 15:12:30,580 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.58 vs. limit=6.0 2023-06-24 15:12:55,584 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1126182.0, ans=0.125 2023-06-24 15:13:17,053 INFO [train.py:996] (2/4) Epoch 7, batch 4750, loss[loss=0.2301, simple_loss=0.2972, pruned_loss=0.08155, over 21335.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.2933, pruned_loss=0.07286, over 4279486.96 frames. ], batch size: 143, lr: 4.43e-03, grad_scale: 32.0 2023-06-24 15:13:36,015 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1126302.0, ans=0.125 2023-06-24 15:13:50,121 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1126362.0, ans=0.2 2023-06-24 15:13:50,139 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1126362.0, ans=0.125 2023-06-24 15:14:31,583 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1126482.0, ans=0.125 2023-06-24 15:14:58,933 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1126542.0, ans=0.2 2023-06-24 15:15:13,781 INFO [train.py:996] (2/4) Epoch 7, batch 4800, loss[loss=0.2097, simple_loss=0.3139, pruned_loss=0.05276, over 21832.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.2938, pruned_loss=0.07333, over 4279988.20 frames. ], batch size: 316, lr: 4.43e-03, grad_scale: 32.0 2023-06-24 15:15:19,154 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.126e+02 2.780e+02 3.342e+02 3.933e+02 6.055e+02, threshold=6.684e+02, percent-clipped=1.0 2023-06-24 15:16:59,131 INFO [train.py:996] (2/4) Epoch 7, batch 4850, loss[loss=0.2125, simple_loss=0.2793, pruned_loss=0.07286, over 21779.00 frames. ], tot_loss[loss=0.22, simple_loss=0.2939, pruned_loss=0.07301, over 4282428.23 frames. ], batch size: 247, lr: 4.43e-03, grad_scale: 32.0 2023-06-24 15:17:10,104 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1126902.0, ans=0.1 2023-06-24 15:17:15,737 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1126962.0, ans=0.125 2023-06-24 15:17:51,086 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1127022.0, ans=0.2 2023-06-24 15:18:10,942 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1127082.0, ans=0.05 2023-06-24 15:18:15,681 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=6.49 vs. limit=22.5 2023-06-24 15:18:42,417 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=8.61 vs. limit=15.0 2023-06-24 15:18:50,518 INFO [train.py:996] (2/4) Epoch 7, batch 4900, loss[loss=0.227, simple_loss=0.3277, pruned_loss=0.06316, over 21816.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.2976, pruned_loss=0.07343, over 4285363.09 frames. ], batch size: 316, lr: 4.43e-03, grad_scale: 32.0 2023-06-24 15:18:51,105 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1127202.0, ans=0.125 2023-06-24 15:18:55,342 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.943e+02 2.644e+02 3.017e+02 3.473e+02 6.026e+02, threshold=6.033e+02, percent-clipped=0.0 2023-06-24 15:19:47,684 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1127322.0, ans=0.125 2023-06-24 15:20:01,685 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1127382.0, ans=0.0 2023-06-24 15:20:29,467 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1127442.0, ans=0.0 2023-06-24 15:20:41,560 INFO [train.py:996] (2/4) Epoch 7, batch 4950, loss[loss=0.1891, simple_loss=0.2834, pruned_loss=0.04746, over 21238.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.3022, pruned_loss=0.07204, over 4286578.47 frames. ], batch size: 176, lr: 4.43e-03, grad_scale: 32.0 2023-06-24 15:21:00,053 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.73 vs. limit=12.0 2023-06-24 15:21:32,206 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.11 vs. limit=10.0 2023-06-24 15:21:59,141 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1127682.0, ans=0.035 2023-06-24 15:22:14,036 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.09 vs. limit=6.0 2023-06-24 15:22:30,516 INFO [train.py:996] (2/4) Epoch 7, batch 5000, loss[loss=0.221, simple_loss=0.2966, pruned_loss=0.0727, over 21875.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.3007, pruned_loss=0.06942, over 4282842.78 frames. ], batch size: 371, lr: 4.43e-03, grad_scale: 32.0 2023-06-24 15:22:35,396 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.800e+02 2.510e+02 2.912e+02 3.367e+02 5.959e+02, threshold=5.824e+02, percent-clipped=0.0 2023-06-24 15:22:48,256 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1127862.0, ans=0.1 2023-06-24 15:23:07,930 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1127862.0, ans=0.0 2023-06-24 15:23:17,028 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.36 vs. limit=15.0 2023-06-24 15:23:50,984 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.73 vs. limit=15.0 2023-06-24 15:23:51,254 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.40 vs. limit=6.0 2023-06-24 15:23:57,666 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1128042.0, ans=0.1 2023-06-24 15:24:01,248 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1128042.0, ans=0.0 2023-06-24 15:24:19,962 INFO [train.py:996] (2/4) Epoch 7, batch 5050, loss[loss=0.2372, simple_loss=0.3108, pruned_loss=0.08182, over 21778.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.3008, pruned_loss=0.07123, over 4283659.35 frames. ], batch size: 112, lr: 4.42e-03, grad_scale: 32.0 2023-06-24 15:24:25,561 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1128102.0, ans=0.0 2023-06-24 15:24:56,932 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 15:26:10,474 INFO [train.py:996] (2/4) Epoch 7, batch 5100, loss[loss=0.2227, simple_loss=0.2984, pruned_loss=0.07352, over 21734.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.3, pruned_loss=0.07173, over 4290520.35 frames. ], batch size: 389, lr: 4.42e-03, grad_scale: 16.0 2023-06-24 15:26:17,209 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.025e+02 2.691e+02 3.129e+02 3.589e+02 6.328e+02, threshold=6.257e+02, percent-clipped=2.0 2023-06-24 15:27:06,035 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.61 vs. limit=15.0 2023-06-24 15:28:00,596 INFO [train.py:996] (2/4) Epoch 7, batch 5150, loss[loss=0.3022, simple_loss=0.366, pruned_loss=0.1192, over 21625.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.2989, pruned_loss=0.07293, over 4291428.19 frames. ], batch size: 508, lr: 4.42e-03, grad_scale: 16.0 2023-06-24 15:28:33,083 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1128762.0, ans=0.2 2023-06-24 15:29:29,709 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1128942.0, ans=0.125 2023-06-24 15:29:52,288 INFO [train.py:996] (2/4) Epoch 7, batch 5200, loss[loss=0.2397, simple_loss=0.3333, pruned_loss=0.0731, over 21753.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.3016, pruned_loss=0.07348, over 4281621.97 frames. ], batch size: 247, lr: 4.42e-03, grad_scale: 32.0 2023-06-24 15:29:59,475 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.122e+02 2.769e+02 3.246e+02 4.133e+02 8.749e+02, threshold=6.492e+02, percent-clipped=7.0 2023-06-24 15:30:08,466 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1129062.0, ans=0.125 2023-06-24 15:31:41,120 INFO [train.py:996] (2/4) Epoch 7, batch 5250, loss[loss=0.2175, simple_loss=0.3002, pruned_loss=0.06742, over 21853.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.3052, pruned_loss=0.07199, over 4286313.72 frames. ], batch size: 316, lr: 4.42e-03, grad_scale: 32.0 2023-06-24 15:31:50,305 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1129302.0, ans=0.0 2023-06-24 15:31:55,781 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1129302.0, ans=0.125 2023-06-24 15:32:05,044 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1129362.0, ans=0.025 2023-06-24 15:32:17,229 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1129362.0, ans=0.1 2023-06-24 15:32:32,389 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.04 vs. limit=15.0 2023-06-24 15:33:08,180 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.65 vs. limit=12.0 2023-06-24 15:33:31,873 INFO [train.py:996] (2/4) Epoch 7, batch 5300, loss[loss=0.2167, simple_loss=0.2818, pruned_loss=0.07579, over 21786.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.3048, pruned_loss=0.07275, over 4282144.87 frames. ], batch size: 247, lr: 4.42e-03, grad_scale: 32.0 2023-06-24 15:33:38,419 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.003e+02 2.522e+02 2.825e+02 3.420e+02 5.349e+02, threshold=5.650e+02, percent-clipped=0.0 2023-06-24 15:33:53,143 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1129662.0, ans=0.0 2023-06-24 15:34:01,774 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1129662.0, ans=0.125 2023-06-24 15:34:29,372 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.81 vs. limit=6.0 2023-06-24 15:34:36,808 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1129782.0, ans=0.0 2023-06-24 15:34:36,855 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1129782.0, ans=0.125 2023-06-24 15:35:03,076 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1129842.0, ans=0.05 2023-06-24 15:35:17,641 INFO [train.py:996] (2/4) Epoch 7, batch 5350, loss[loss=0.2429, simple_loss=0.3094, pruned_loss=0.08821, over 21519.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.3038, pruned_loss=0.07441, over 4281567.05 frames. ], batch size: 131, lr: 4.42e-03, grad_scale: 16.0 2023-06-24 15:36:16,848 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1130022.0, ans=0.2 2023-06-24 15:36:40,310 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1130082.0, ans=0.125 2023-06-24 15:36:59,417 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.78 vs. limit=22.5 2023-06-24 15:37:07,138 INFO [train.py:996] (2/4) Epoch 7, batch 5400, loss[loss=0.2128, simple_loss=0.295, pruned_loss=0.06527, over 21890.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.3014, pruned_loss=0.0751, over 4286424.61 frames. ], batch size: 124, lr: 4.42e-03, grad_scale: 16.0 2023-06-24 15:37:16,433 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.105e+02 2.748e+02 3.020e+02 3.535e+02 6.573e+02, threshold=6.041e+02, percent-clipped=2.0 2023-06-24 15:37:36,111 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.08 vs. limit=15.0 2023-06-24 15:38:19,602 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1130382.0, ans=0.125 2023-06-24 15:38:37,765 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1130442.0, ans=0.125 2023-06-24 15:38:59,043 INFO [train.py:996] (2/4) Epoch 7, batch 5450, loss[loss=0.2105, simple_loss=0.2937, pruned_loss=0.06371, over 21326.00 frames. ], tot_loss[loss=0.225, simple_loss=0.3021, pruned_loss=0.07392, over 4284953.61 frames. ], batch size: 194, lr: 4.42e-03, grad_scale: 16.0 2023-06-24 15:39:04,872 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1130502.0, ans=0.5 2023-06-24 15:39:38,227 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1130622.0, ans=0.125 2023-06-24 15:40:01,557 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1130682.0, ans=0.07 2023-06-24 15:40:16,126 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1130682.0, ans=0.125 2023-06-24 15:40:36,324 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=1130742.0, ans=15.0 2023-06-24 15:40:50,130 INFO [train.py:996] (2/4) Epoch 7, batch 5500, loss[loss=0.1938, simple_loss=0.2785, pruned_loss=0.05461, over 21244.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.3068, pruned_loss=0.07101, over 4281690.47 frames. ], batch size: 159, lr: 4.42e-03, grad_scale: 16.0 2023-06-24 15:40:51,429 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=8.93 vs. limit=15.0 2023-06-24 15:40:55,483 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1130802.0, ans=0.2 2023-06-24 15:40:58,245 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.932e+02 2.852e+02 3.783e+02 5.353e+02 8.274e+02, threshold=7.565e+02, percent-clipped=13.0 2023-06-24 15:41:08,012 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1130862.0, ans=0.125 2023-06-24 15:42:38,482 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.73 vs. limit=15.0 2023-06-24 15:42:40,381 INFO [train.py:996] (2/4) Epoch 7, batch 5550, loss[loss=0.2267, simple_loss=0.2891, pruned_loss=0.08214, over 21561.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.3044, pruned_loss=0.06815, over 4277338.95 frames. ], batch size: 548, lr: 4.42e-03, grad_scale: 16.0 2023-06-24 15:42:51,221 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.11 vs. limit=15.0 2023-06-24 15:42:52,229 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1131102.0, ans=0.2 2023-06-24 15:43:41,080 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.38 vs. limit=22.5 2023-06-24 15:43:55,146 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1131282.0, ans=0.0 2023-06-24 15:44:28,945 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1131342.0, ans=0.1 2023-06-24 15:44:31,792 INFO [train.py:996] (2/4) Epoch 7, batch 5600, loss[loss=0.378, simple_loss=0.4461, pruned_loss=0.155, over 21438.00 frames. ], tot_loss[loss=0.218, simple_loss=0.303, pruned_loss=0.0665, over 4275077.99 frames. ], batch size: 507, lr: 4.42e-03, grad_scale: 32.0 2023-06-24 15:44:43,342 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1131402.0, ans=0.125 2023-06-24 15:44:45,773 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.709e+02 2.480e+02 2.959e+02 3.871e+02 8.894e+02, threshold=5.918e+02, percent-clipped=1.0 2023-06-24 15:44:48,594 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.23 vs. limit=6.0 2023-06-24 15:45:35,184 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1131522.0, ans=0.95 2023-06-24 15:46:19,847 INFO [train.py:996] (2/4) Epoch 7, batch 5650, loss[loss=0.2151, simple_loss=0.2928, pruned_loss=0.06866, over 21812.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.3048, pruned_loss=0.06807, over 4276909.82 frames. ], batch size: 282, lr: 4.42e-03, grad_scale: 32.0 2023-06-24 15:46:28,346 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1131702.0, ans=0.07 2023-06-24 15:46:29,828 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1131702.0, ans=0.125 2023-06-24 15:46:32,815 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1131702.0, ans=0.2 2023-06-24 15:47:47,513 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1131882.0, ans=0.125 2023-06-24 15:47:58,441 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1131942.0, ans=0.0 2023-06-24 15:48:00,381 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1131942.0, ans=0.125 2023-06-24 15:48:15,353 INFO [train.py:996] (2/4) Epoch 7, batch 5700, loss[loss=0.1941, simple_loss=0.2759, pruned_loss=0.05619, over 21516.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.3034, pruned_loss=0.06875, over 4274461.26 frames. ], batch size: 131, lr: 4.42e-03, grad_scale: 16.0 2023-06-24 15:48:23,884 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.16 vs. limit=15.0 2023-06-24 15:48:26,250 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.045e+02 2.629e+02 3.066e+02 3.731e+02 7.827e+02, threshold=6.133e+02, percent-clipped=4.0 2023-06-24 15:48:44,651 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.82 vs. limit=15.0 2023-06-24 15:49:21,593 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1132122.0, ans=0.0 2023-06-24 15:49:26,090 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.94 vs. limit=15.0 2023-06-24 15:49:56,062 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1132242.0, ans=0.0 2023-06-24 15:50:06,429 INFO [train.py:996] (2/4) Epoch 7, batch 5750, loss[loss=0.1823, simple_loss=0.2871, pruned_loss=0.03869, over 21179.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.2996, pruned_loss=0.06612, over 4269349.10 frames. ], batch size: 548, lr: 4.42e-03, grad_scale: 16.0 2023-06-24 15:50:07,023 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1132302.0, ans=0.125 2023-06-24 15:50:10,790 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1132302.0, ans=0.0 2023-06-24 15:50:28,878 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=17.31 vs. limit=22.5 2023-06-24 15:50:36,396 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1132362.0, ans=0.125 2023-06-24 15:51:11,419 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1132422.0, ans=0.0 2023-06-24 15:51:16,129 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.37 vs. limit=15.0 2023-06-24 15:51:46,839 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1132542.0, ans=0.1 2023-06-24 15:51:51,780 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1132542.0, ans=0.0 2023-06-24 15:51:56,230 INFO [train.py:996] (2/4) Epoch 7, batch 5800, loss[loss=0.2024, simple_loss=0.291, pruned_loss=0.05689, over 21600.00 frames. ], tot_loss[loss=0.215, simple_loss=0.2998, pruned_loss=0.06506, over 4276575.32 frames. ], batch size: 230, lr: 4.42e-03, grad_scale: 16.0 2023-06-24 15:52:12,050 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.870e+02 2.681e+02 3.323e+02 4.302e+02 6.884e+02, threshold=6.646e+02, percent-clipped=1.0 2023-06-24 15:52:48,946 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1132722.0, ans=0.1 2023-06-24 15:52:50,595 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1132722.0, ans=0.125 2023-06-24 15:53:44,710 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.23 vs. limit=15.0 2023-06-24 15:53:58,480 INFO [train.py:996] (2/4) Epoch 7, batch 5850, loss[loss=0.1693, simple_loss=0.2712, pruned_loss=0.03371, over 21803.00 frames. ], tot_loss[loss=0.2108, simple_loss=0.2982, pruned_loss=0.06166, over 4280254.83 frames. ], batch size: 316, lr: 4.42e-03, grad_scale: 16.0 2023-06-24 15:54:23,719 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1132962.0, ans=0.0 2023-06-24 15:55:46,739 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.27 vs. limit=15.0 2023-06-24 15:55:51,571 INFO [train.py:996] (2/4) Epoch 7, batch 5900, loss[loss=0.1722, simple_loss=0.2597, pruned_loss=0.04241, over 21759.00 frames. ], tot_loss[loss=0.2052, simple_loss=0.2934, pruned_loss=0.05853, over 4281902.09 frames. ], batch size: 298, lr: 4.41e-03, grad_scale: 16.0 2023-06-24 15:55:58,846 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1133202.0, ans=0.125 2023-06-24 15:56:01,772 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.413e+02 2.024e+02 2.372e+02 2.933e+02 6.586e+02, threshold=4.744e+02, percent-clipped=0.0 2023-06-24 15:56:13,071 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1133262.0, ans=0.125 2023-06-24 15:56:59,653 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.12 vs. limit=10.0 2023-06-24 15:57:16,066 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1133442.0, ans=0.125 2023-06-24 15:57:37,219 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1133442.0, ans=0.125 2023-06-24 15:57:39,660 INFO [train.py:996] (2/4) Epoch 7, batch 5950, loss[loss=0.2273, simple_loss=0.2923, pruned_loss=0.08114, over 21920.00 frames. ], tot_loss[loss=0.2074, simple_loss=0.2918, pruned_loss=0.06148, over 4286994.51 frames. ], batch size: 316, lr: 4.41e-03, grad_scale: 16.0 2023-06-24 15:57:44,012 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1133502.0, ans=0.125 2023-06-24 15:59:01,473 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1133742.0, ans=0.125 2023-06-24 15:59:27,103 INFO [train.py:996] (2/4) Epoch 7, batch 6000, loss[loss=0.2056, simple_loss=0.2652, pruned_loss=0.07303, over 21764.00 frames. ], tot_loss[loss=0.2082, simple_loss=0.2895, pruned_loss=0.06347, over 4284694.98 frames. ], batch size: 351, lr: 4.41e-03, grad_scale: 32.0 2023-06-24 15:59:27,103 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-24 15:59:44,457 INFO [train.py:1028] (2/4) Epoch 7, validation: loss=0.2613, simple_loss=0.3539, pruned_loss=0.08436, over 1796401.00 frames. 2023-06-24 15:59:44,458 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 23731MB 2023-06-24 15:59:57,242 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.008e+02 3.144e+02 3.731e+02 4.665e+02 6.977e+02, threshold=7.462e+02, percent-clipped=24.0 2023-06-24 16:00:23,242 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1133862.0, ans=0.125 2023-06-24 16:00:31,660 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1133922.0, ans=0.125 2023-06-24 16:00:49,899 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1133982.0, ans=0.0 2023-06-24 16:01:28,498 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1134042.0, ans=0.125 2023-06-24 16:01:36,674 INFO [train.py:996] (2/4) Epoch 7, batch 6050, loss[loss=0.1931, simple_loss=0.2537, pruned_loss=0.06626, over 21628.00 frames. ], tot_loss[loss=0.2074, simple_loss=0.2851, pruned_loss=0.06481, over 4280910.55 frames. ], batch size: 231, lr: 4.41e-03, grad_scale: 16.0 2023-06-24 16:01:59,793 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1134162.0, ans=0.125 2023-06-24 16:02:01,359 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1134162.0, ans=0.09899494936611666 2023-06-24 16:03:27,722 INFO [train.py:996] (2/4) Epoch 7, batch 6100, loss[loss=0.2015, simple_loss=0.2826, pruned_loss=0.06024, over 21311.00 frames. ], tot_loss[loss=0.204, simple_loss=0.2814, pruned_loss=0.06327, over 4280864.27 frames. ], batch size: 176, lr: 4.41e-03, grad_scale: 16.0 2023-06-24 16:03:39,888 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.866e+02 2.425e+02 2.947e+02 3.693e+02 6.413e+02, threshold=5.895e+02, percent-clipped=0.0 2023-06-24 16:05:16,112 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1134702.0, ans=0.1 2023-06-24 16:05:17,155 INFO [train.py:996] (2/4) Epoch 7, batch 6150, loss[loss=0.2757, simple_loss=0.3286, pruned_loss=0.1115, over 21799.00 frames. ], tot_loss[loss=0.2079, simple_loss=0.2841, pruned_loss=0.06581, over 4284897.36 frames. ], batch size: 507, lr: 4.41e-03, grad_scale: 8.0 2023-06-24 16:05:21,614 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1134702.0, ans=0.125 2023-06-24 16:05:23,109 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1134702.0, ans=0.1 2023-06-24 16:05:25,094 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1134702.0, ans=0.2 2023-06-24 16:05:47,866 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1134762.0, ans=0.0 2023-06-24 16:05:51,016 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1134762.0, ans=0.125 2023-06-24 16:06:08,329 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1134822.0, ans=0.125 2023-06-24 16:06:19,061 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1134882.0, ans=0.125 2023-06-24 16:07:05,671 INFO [train.py:996] (2/4) Epoch 7, batch 6200, loss[loss=0.2172, simple_loss=0.2947, pruned_loss=0.06987, over 21649.00 frames. ], tot_loss[loss=0.2093, simple_loss=0.2865, pruned_loss=0.06602, over 4282285.07 frames. ], batch size: 230, lr: 4.41e-03, grad_scale: 8.0 2023-06-24 16:07:25,609 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.918e+02 2.569e+02 3.119e+02 3.567e+02 5.212e+02, threshold=6.237e+02, percent-clipped=0.0 2023-06-24 16:08:16,640 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1135182.0, ans=0.125 2023-06-24 16:08:31,465 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1135182.0, ans=0.1 2023-06-24 16:08:43,757 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1135242.0, ans=0.125 2023-06-24 16:08:56,841 INFO [train.py:996] (2/4) Epoch 7, batch 6250, loss[loss=0.2174, simple_loss=0.3111, pruned_loss=0.06187, over 21689.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2935, pruned_loss=0.06683, over 4281595.97 frames. ], batch size: 247, lr: 4.41e-03, grad_scale: 8.0 2023-06-24 16:09:13,014 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1135302.0, ans=0.0 2023-06-24 16:09:39,583 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1135422.0, ans=0.1 2023-06-24 16:09:47,453 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1135422.0, ans=0.125 2023-06-24 16:10:18,924 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1135482.0, ans=0.125 2023-06-24 16:10:31,849 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1135542.0, ans=0.125 2023-06-24 16:10:51,877 INFO [train.py:996] (2/4) Epoch 7, batch 6300, loss[loss=0.211, simple_loss=0.2925, pruned_loss=0.06471, over 21857.00 frames. ], tot_loss[loss=0.215, simple_loss=0.2969, pruned_loss=0.06657, over 4280975.69 frames. ], batch size: 298, lr: 4.41e-03, grad_scale: 8.0 2023-06-24 16:11:06,079 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.925e+02 2.617e+02 3.122e+02 4.088e+02 6.551e+02, threshold=6.244e+02, percent-clipped=1.0 2023-06-24 16:11:06,661 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1135602.0, ans=10.0 2023-06-24 16:11:13,014 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1135662.0, ans=0.1 2023-06-24 16:11:48,950 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1135722.0, ans=0.125 2023-06-24 16:11:51,035 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.74 vs. limit=10.0 2023-06-24 16:11:52,023 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1135722.0, ans=0.2 2023-06-24 16:12:40,767 INFO [train.py:996] (2/4) Epoch 7, batch 6350, loss[loss=0.2708, simple_loss=0.3346, pruned_loss=0.1035, over 21612.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.2994, pruned_loss=0.0697, over 4280981.58 frames. ], batch size: 415, lr: 4.41e-03, grad_scale: 8.0 2023-06-24 16:12:41,337 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 16:12:41,915 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.90 vs. limit=6.0 2023-06-24 16:12:52,232 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1135902.0, ans=0.1 2023-06-24 16:13:04,119 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1135962.0, ans=0.125 2023-06-24 16:13:14,722 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1135962.0, ans=0.125 2023-06-24 16:13:16,446 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1135962.0, ans=0.2 2023-06-24 16:14:08,796 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1136082.0, ans=0.125 2023-06-24 16:14:13,819 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1136142.0, ans=0.2 2023-06-24 16:14:23,633 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.33 vs. limit=12.0 2023-06-24 16:14:27,819 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1136142.0, ans=0.07 2023-06-24 16:14:30,433 INFO [train.py:996] (2/4) Epoch 7, batch 6400, loss[loss=0.2607, simple_loss=0.3259, pruned_loss=0.09773, over 21377.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.304, pruned_loss=0.0729, over 4280430.65 frames. ], batch size: 176, lr: 4.41e-03, grad_scale: 16.0 2023-06-24 16:14:50,066 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.44 vs. limit=22.5 2023-06-24 16:14:55,409 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.098e+02 2.966e+02 3.361e+02 3.840e+02 6.220e+02, threshold=6.721e+02, percent-clipped=0.0 2023-06-24 16:16:25,907 INFO [train.py:996] (2/4) Epoch 7, batch 6450, loss[loss=0.2686, simple_loss=0.3198, pruned_loss=0.1087, over 21368.00 frames. ], tot_loss[loss=0.227, simple_loss=0.3076, pruned_loss=0.07324, over 4277880.54 frames. ], batch size: 507, lr: 4.41e-03, grad_scale: 16.0 2023-06-24 16:16:43,930 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1136502.0, ans=0.1 2023-06-24 16:16:56,438 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.60 vs. limit=15.0 2023-06-24 16:17:15,331 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1136622.0, ans=0.0 2023-06-24 16:17:52,305 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1136742.0, ans=0.2 2023-06-24 16:18:14,957 INFO [train.py:996] (2/4) Epoch 7, batch 6500, loss[loss=0.1975, simple_loss=0.2903, pruned_loss=0.05237, over 21826.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.3023, pruned_loss=0.07143, over 4271778.00 frames. ], batch size: 317, lr: 4.41e-03, grad_scale: 16.0 2023-06-24 16:18:38,259 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.984e+02 2.852e+02 3.600e+02 4.849e+02 8.797e+02, threshold=7.199e+02, percent-clipped=3.0 2023-06-24 16:18:54,406 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1136862.0, ans=0.0 2023-06-24 16:19:12,882 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.42 vs. limit=15.0 2023-06-24 16:19:35,417 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1136982.0, ans=0.125 2023-06-24 16:19:47,817 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1137042.0, ans=0.1 2023-06-24 16:20:03,506 INFO [train.py:996] (2/4) Epoch 7, batch 6550, loss[loss=0.2568, simple_loss=0.3209, pruned_loss=0.09637, over 21777.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.3012, pruned_loss=0.07026, over 4279582.61 frames. ], batch size: 112, lr: 4.41e-03, grad_scale: 16.0 2023-06-24 16:20:25,668 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.92 vs. limit=12.0 2023-06-24 16:20:35,210 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1137162.0, ans=0.125 2023-06-24 16:20:38,865 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1137162.0, ans=0.125 2023-06-24 16:20:43,587 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1137162.0, ans=0.05 2023-06-24 16:20:47,423 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1137222.0, ans=0.125 2023-06-24 16:21:15,582 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1137282.0, ans=0.1 2023-06-24 16:21:47,403 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1137342.0, ans=0.125 2023-06-24 16:21:53,198 INFO [train.py:996] (2/4) Epoch 7, batch 6600, loss[loss=0.2048, simple_loss=0.2687, pruned_loss=0.0705, over 21729.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.2954, pruned_loss=0.06978, over 4259457.18 frames. ], batch size: 371, lr: 4.41e-03, grad_scale: 16.0 2023-06-24 16:22:01,423 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.68 vs. limit=10.0 2023-06-24 16:22:05,387 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.74 vs. limit=15.0 2023-06-24 16:22:14,329 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1137402.0, ans=0.0 2023-06-24 16:22:17,169 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.905e+02 2.518e+02 2.917e+02 3.263e+02 5.305e+02, threshold=5.833e+02, percent-clipped=0.0 2023-06-24 16:22:30,922 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.09 vs. limit=12.0 2023-06-24 16:23:02,431 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.31 vs. limit=15.0 2023-06-24 16:23:53,033 INFO [train.py:996] (2/4) Epoch 7, batch 6650, loss[loss=0.1863, simple_loss=0.2586, pruned_loss=0.05703, over 21810.00 frames. ], tot_loss[loss=0.2103, simple_loss=0.2866, pruned_loss=0.06699, over 4261718.46 frames. ], batch size: 352, lr: 4.41e-03, grad_scale: 16.0 2023-06-24 16:23:57,220 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1137702.0, ans=0.0 2023-06-24 16:24:13,437 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.60 vs. limit=15.0 2023-06-24 16:24:14,543 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1137762.0, ans=0.04949747468305833 2023-06-24 16:24:33,960 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1137822.0, ans=0.125 2023-06-24 16:24:53,157 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.42 vs. limit=15.0 2023-06-24 16:25:43,982 INFO [train.py:996] (2/4) Epoch 7, batch 6700, loss[loss=0.2055, simple_loss=0.2784, pruned_loss=0.06629, over 21691.00 frames. ], tot_loss[loss=0.2075, simple_loss=0.281, pruned_loss=0.06695, over 4262107.90 frames. ], batch size: 282, lr: 4.41e-03, grad_scale: 16.0 2023-06-24 16:25:57,332 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.898e+02 2.457e+02 2.786e+02 3.230e+02 4.297e+02, threshold=5.572e+02, percent-clipped=0.0 2023-06-24 16:26:53,559 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1138182.0, ans=0.125 2023-06-24 16:26:54,134 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.09 vs. limit=15.0 2023-06-24 16:27:26,670 INFO [train.py:996] (2/4) Epoch 7, batch 6750, loss[loss=0.2214, simple_loss=0.2847, pruned_loss=0.07905, over 21405.00 frames. ], tot_loss[loss=0.2071, simple_loss=0.279, pruned_loss=0.06758, over 4269701.54 frames. ], batch size: 194, lr: 4.40e-03, grad_scale: 16.0 2023-06-24 16:27:50,403 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 16:28:09,146 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1138422.0, ans=0.125 2023-06-24 16:28:35,498 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1138482.0, ans=0.0 2023-06-24 16:28:38,185 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.67 vs. limit=15.0 2023-06-24 16:28:51,376 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1138542.0, ans=0.025 2023-06-24 16:29:09,318 INFO [train.py:996] (2/4) Epoch 7, batch 6800, loss[loss=0.2122, simple_loss=0.3415, pruned_loss=0.04145, over 19744.00 frames. ], tot_loss[loss=0.21, simple_loss=0.2813, pruned_loss=0.06934, over 4267852.58 frames. ], batch size: 702, lr: 4.40e-03, grad_scale: 32.0 2023-06-24 16:29:15,556 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1138602.0, ans=0.125 2023-06-24 16:29:23,224 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.205e+02 2.710e+02 3.194e+02 3.747e+02 5.784e+02, threshold=6.389e+02, percent-clipped=2.0 2023-06-24 16:29:25,616 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1138662.0, ans=0.0 2023-06-24 16:29:27,593 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1138662.0, ans=0.2 2023-06-24 16:29:57,279 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1138722.0, ans=0.0 2023-06-24 16:30:16,140 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1138782.0, ans=0.125 2023-06-24 16:30:51,554 INFO [train.py:996] (2/4) Epoch 7, batch 6850, loss[loss=0.1907, simple_loss=0.2564, pruned_loss=0.06244, over 21167.00 frames. ], tot_loss[loss=0.2113, simple_loss=0.2809, pruned_loss=0.07078, over 4276497.72 frames. ], batch size: 176, lr: 4.40e-03, grad_scale: 16.0 2023-06-24 16:31:14,629 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1138962.0, ans=0.125 2023-06-24 16:31:34,190 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1139022.0, ans=0.125 2023-06-24 16:32:41,600 INFO [train.py:996] (2/4) Epoch 7, batch 6900, loss[loss=0.2127, simple_loss=0.3141, pruned_loss=0.0557, over 21528.00 frames. ], tot_loss[loss=0.2123, simple_loss=0.2839, pruned_loss=0.07031, over 4282355.39 frames. ], batch size: 471, lr: 4.40e-03, grad_scale: 16.0 2023-06-24 16:33:03,122 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.166e+02 2.809e+02 3.309e+02 4.065e+02 7.013e+02, threshold=6.619e+02, percent-clipped=1.0 2023-06-24 16:33:10,855 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1139262.0, ans=0.2 2023-06-24 16:33:28,978 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1139322.0, ans=0.125 2023-06-24 16:33:52,634 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.44 vs. limit=22.5 2023-06-24 16:33:59,010 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1139382.0, ans=0.1 2023-06-24 16:34:37,858 INFO [train.py:996] (2/4) Epoch 7, batch 6950, loss[loss=0.2623, simple_loss=0.3394, pruned_loss=0.09264, over 21469.00 frames. ], tot_loss[loss=0.2108, simple_loss=0.2856, pruned_loss=0.06801, over 4278641.24 frames. ], batch size: 131, lr: 4.40e-03, grad_scale: 16.0 2023-06-24 16:35:26,110 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1139622.0, ans=0.125 2023-06-24 16:35:28,023 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1139622.0, ans=0.0 2023-06-24 16:35:35,771 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1139682.0, ans=0.125 2023-06-24 16:36:15,989 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1139742.0, ans=0.0 2023-06-24 16:36:27,768 INFO [train.py:996] (2/4) Epoch 7, batch 7000, loss[loss=0.2343, simple_loss=0.3003, pruned_loss=0.08413, over 20803.00 frames. ], tot_loss[loss=0.2149, simple_loss=0.2892, pruned_loss=0.07027, over 4277649.10 frames. ], batch size: 608, lr: 4.40e-03, grad_scale: 16.0 2023-06-24 16:36:36,016 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1139802.0, ans=0.125 2023-06-24 16:36:49,406 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.743e+02 2.855e+02 3.392e+02 4.148e+02 6.941e+02, threshold=6.785e+02, percent-clipped=1.0 2023-06-24 16:37:17,578 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=1139922.0, ans=15.0 2023-06-24 16:37:20,707 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1139922.0, ans=0.1 2023-06-24 16:37:38,711 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1139982.0, ans=0.125 2023-06-24 16:37:48,021 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1139982.0, ans=0.125 2023-06-24 16:38:13,843 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1140042.0, ans=0.125 2023-06-24 16:38:13,846 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1140042.0, ans=0.125 2023-06-24 16:38:18,590 INFO [train.py:996] (2/4) Epoch 7, batch 7050, loss[loss=0.2382, simple_loss=0.3259, pruned_loss=0.07522, over 21411.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2881, pruned_loss=0.07029, over 4278222.83 frames. ], batch size: 507, lr: 4.40e-03, grad_scale: 16.0 2023-06-24 16:38:38,375 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1140102.0, ans=0.035 2023-06-24 16:38:49,373 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1140162.0, ans=0.125 2023-06-24 16:38:49,440 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1140162.0, ans=0.04949747468305833 2023-06-24 16:40:15,922 INFO [train.py:996] (2/4) Epoch 7, batch 7100, loss[loss=0.2535, simple_loss=0.3245, pruned_loss=0.09129, over 21395.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.292, pruned_loss=0.07173, over 4272463.73 frames. ], batch size: 471, lr: 4.40e-03, grad_scale: 16.0 2023-06-24 16:40:23,552 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1140402.0, ans=0.125 2023-06-24 16:40:31,717 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.970e+02 2.792e+02 3.207e+02 3.771e+02 5.994e+02, threshold=6.414e+02, percent-clipped=0.0 2023-06-24 16:40:43,565 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1140462.0, ans=0.125 2023-06-24 16:41:08,803 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1140522.0, ans=0.125 2023-06-24 16:41:25,362 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.28 vs. limit=22.5 2023-06-24 16:42:06,514 INFO [train.py:996] (2/4) Epoch 7, batch 7150, loss[loss=0.2467, simple_loss=0.3324, pruned_loss=0.08052, over 21845.00 frames. ], tot_loss[loss=0.2149, simple_loss=0.2907, pruned_loss=0.0696, over 4272256.09 frames. ], batch size: 118, lr: 4.40e-03, grad_scale: 16.0 2023-06-24 16:42:16,425 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1140702.0, ans=0.1 2023-06-24 16:43:38,233 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.86 vs. limit=12.0 2023-06-24 16:43:38,357 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=18.37 vs. limit=22.5 2023-06-24 16:43:39,556 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1140942.0, ans=0.1 2023-06-24 16:43:43,006 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1140942.0, ans=0.0 2023-06-24 16:43:56,427 INFO [train.py:996] (2/4) Epoch 7, batch 7200, loss[loss=0.1983, simple_loss=0.2682, pruned_loss=0.06419, over 21829.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.2941, pruned_loss=0.07177, over 4270523.21 frames. ], batch size: 317, lr: 4.40e-03, grad_scale: 32.0 2023-06-24 16:44:09,471 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1141002.0, ans=0.125 2023-06-24 16:44:12,349 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.853e+02 2.840e+02 3.235e+02 4.044e+02 5.731e+02, threshold=6.469e+02, percent-clipped=0.0 2023-06-24 16:44:58,437 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1141122.0, ans=0.07 2023-06-24 16:45:34,701 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.82 vs. limit=15.0 2023-06-24 16:45:38,776 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1141242.0, ans=0.125 2023-06-24 16:45:38,813 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1141242.0, ans=0.2 2023-06-24 16:45:45,326 INFO [train.py:996] (2/4) Epoch 7, batch 7250, loss[loss=0.1959, simple_loss=0.262, pruned_loss=0.06486, over 21594.00 frames. ], tot_loss[loss=0.2167, simple_loss=0.2894, pruned_loss=0.07204, over 4277809.92 frames. ], batch size: 298, lr: 4.40e-03, grad_scale: 16.0 2023-06-24 16:45:54,417 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1141302.0, ans=0.0 2023-06-24 16:46:01,983 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1141362.0, ans=0.125 2023-06-24 16:46:10,219 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1141362.0, ans=0.125 2023-06-24 16:46:37,077 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.75 vs. limit=15.0 2023-06-24 16:47:13,749 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1141482.0, ans=0.0 2023-06-24 16:47:13,797 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1141482.0, ans=0.0 2023-06-24 16:47:34,237 INFO [train.py:996] (2/4) Epoch 7, batch 7300, loss[loss=0.1879, simple_loss=0.2535, pruned_loss=0.06119, over 21817.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2853, pruned_loss=0.07099, over 4269417.07 frames. ], batch size: 352, lr: 4.40e-03, grad_scale: 16.0 2023-06-24 16:47:44,762 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1141602.0, ans=0.1 2023-06-24 16:47:51,216 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.050e+02 2.579e+02 3.088e+02 3.610e+02 6.583e+02, threshold=6.177e+02, percent-clipped=0.0 2023-06-24 16:48:28,327 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1141722.0, ans=0.125 2023-06-24 16:49:10,910 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1141842.0, ans=0.0 2023-06-24 16:49:25,156 INFO [train.py:996] (2/4) Epoch 7, batch 7350, loss[loss=0.2058, simple_loss=0.2748, pruned_loss=0.06839, over 21741.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.2824, pruned_loss=0.07151, over 4266068.36 frames. ], batch size: 102, lr: 4.40e-03, grad_scale: 16.0 2023-06-24 16:49:57,293 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.06 vs. limit=6.0 2023-06-24 16:50:50,295 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.22 vs. limit=6.0 2023-06-24 16:51:09,149 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1142142.0, ans=0.0 2023-06-24 16:51:11,772 INFO [train.py:996] (2/4) Epoch 7, batch 7400, loss[loss=0.2461, simple_loss=0.3189, pruned_loss=0.08667, over 21609.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.288, pruned_loss=0.07356, over 4273802.46 frames. ], batch size: 389, lr: 4.40e-03, grad_scale: 16.0 2023-06-24 16:51:13,381 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.04 vs. limit=6.0 2023-06-24 16:51:38,464 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1142202.0, ans=0.0 2023-06-24 16:51:41,597 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.150e+02 2.851e+02 3.315e+02 4.181e+02 6.542e+02, threshold=6.630e+02, percent-clipped=3.0 2023-06-24 16:52:19,096 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1142322.0, ans=0.125 2023-06-24 16:52:29,788 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1142382.0, ans=0.125 2023-06-24 16:52:31,790 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 16:52:37,620 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1142382.0, ans=0.125 2023-06-24 16:52:49,891 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1142442.0, ans=0.125 2023-06-24 16:53:02,626 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.12 vs. limit=15.0 2023-06-24 16:53:03,474 INFO [train.py:996] (2/4) Epoch 7, batch 7450, loss[loss=0.21, simple_loss=0.2804, pruned_loss=0.06982, over 21486.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2871, pruned_loss=0.07295, over 4278115.11 frames. ], batch size: 389, lr: 4.40e-03, grad_scale: 16.0 2023-06-24 16:53:55,307 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.45 vs. limit=22.5 2023-06-24 16:54:32,217 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 16:55:06,459 INFO [train.py:996] (2/4) Epoch 7, batch 7500, loss[loss=0.2485, simple_loss=0.3582, pruned_loss=0.06938, over 21754.00 frames. ], tot_loss[loss=0.22, simple_loss=0.2921, pruned_loss=0.07399, over 4284339.09 frames. ], batch size: 332, lr: 4.40e-03, grad_scale: 16.0 2023-06-24 16:55:29,613 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.290e+02 3.030e+02 3.534e+02 4.560e+02 9.672e+02, threshold=7.067e+02, percent-clipped=6.0 2023-06-24 16:55:46,020 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.98 vs. limit=15.0 2023-06-24 16:56:56,955 INFO [train.py:996] (2/4) Epoch 7, batch 7550, loss[loss=0.2032, simple_loss=0.2975, pruned_loss=0.05448, over 21712.00 frames. ], tot_loss[loss=0.2219, simple_loss=0.2986, pruned_loss=0.07265, over 4287194.40 frames. ], batch size: 298, lr: 4.40e-03, grad_scale: 16.0 2023-06-24 16:57:00,896 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1143102.0, ans=0.1 2023-06-24 16:57:16,545 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1143102.0, ans=0.125 2023-06-24 16:57:16,609 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1143102.0, ans=0.125 2023-06-24 16:57:56,924 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1143222.0, ans=0.125 2023-06-24 16:58:41,043 INFO [train.py:996] (2/4) Epoch 7, batch 7600, loss[loss=0.2128, simple_loss=0.2674, pruned_loss=0.07912, over 20207.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.298, pruned_loss=0.07174, over 4288332.65 frames. ], batch size: 702, lr: 4.39e-03, grad_scale: 32.0 2023-06-24 16:58:53,557 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.18 vs. limit=15.0 2023-06-24 16:58:59,276 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1143402.0, ans=0.125 2023-06-24 16:59:09,489 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.047e+02 2.834e+02 3.229e+02 4.103e+02 6.859e+02, threshold=6.458e+02, percent-clipped=0.0 2023-06-24 16:59:16,553 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1143462.0, ans=0.125 2023-06-24 16:59:34,395 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1143522.0, ans=0.2 2023-06-24 16:59:55,946 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.46 vs. limit=15.0 2023-06-24 17:00:36,331 INFO [train.py:996] (2/4) Epoch 7, batch 7650, loss[loss=0.2381, simple_loss=0.3053, pruned_loss=0.08544, over 21878.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.2964, pruned_loss=0.07309, over 4283913.33 frames. ], batch size: 371, lr: 4.39e-03, grad_scale: 32.0 2023-06-24 17:01:36,691 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1143822.0, ans=0.1 2023-06-24 17:01:41,476 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1143882.0, ans=0.1 2023-06-24 17:02:28,499 INFO [train.py:996] (2/4) Epoch 7, batch 7700, loss[loss=0.2682, simple_loss=0.3477, pruned_loss=0.09431, over 21825.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.3, pruned_loss=0.07519, over 4286555.53 frames. ], batch size: 118, lr: 4.39e-03, grad_scale: 16.0 2023-06-24 17:02:53,678 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.223e+02 2.786e+02 3.159e+02 3.961e+02 6.423e+02, threshold=6.319e+02, percent-clipped=0.0 2023-06-24 17:04:13,514 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1144242.0, ans=0.0 2023-06-24 17:04:29,051 INFO [train.py:996] (2/4) Epoch 7, batch 7750, loss[loss=0.2595, simple_loss=0.3682, pruned_loss=0.07545, over 21770.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.3059, pruned_loss=0.07449, over 4281738.88 frames. ], batch size: 332, lr: 4.39e-03, grad_scale: 16.0 2023-06-24 17:05:10,834 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=1144422.0, ans=0.5 2023-06-24 17:06:27,893 INFO [train.py:996] (2/4) Epoch 7, batch 7800, loss[loss=0.2634, simple_loss=0.341, pruned_loss=0.09293, over 21461.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3107, pruned_loss=0.07579, over 4288128.85 frames. ], batch size: 471, lr: 4.39e-03, grad_scale: 16.0 2023-06-24 17:06:38,160 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.07 vs. limit=15.0 2023-06-24 17:06:47,319 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.299e+02 3.311e+02 4.032e+02 5.871e+02 9.097e+02, threshold=8.064e+02, percent-clipped=12.0 2023-06-24 17:07:22,704 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1144722.0, ans=0.0 2023-06-24 17:07:40,339 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1144782.0, ans=0.125 2023-06-24 17:08:01,029 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.12 vs. limit=10.0 2023-06-24 17:08:01,958 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1144842.0, ans=0.125 2023-06-24 17:08:10,985 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.96 vs. limit=15.0 2023-06-24 17:08:11,748 INFO [train.py:996] (2/4) Epoch 7, batch 7850, loss[loss=0.2195, simple_loss=0.2845, pruned_loss=0.07722, over 22001.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.3018, pruned_loss=0.07462, over 4279641.95 frames. ], batch size: 103, lr: 4.39e-03, grad_scale: 16.0 2023-06-24 17:09:36,746 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.28 vs. limit=12.0 2023-06-24 17:09:40,149 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1145142.0, ans=0.125 2023-06-24 17:09:46,992 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1145142.0, ans=0.125 2023-06-24 17:10:05,157 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1145142.0, ans=0.125 2023-06-24 17:10:10,680 INFO [train.py:996] (2/4) Epoch 7, batch 7900, loss[loss=0.2428, simple_loss=0.3341, pruned_loss=0.07572, over 21600.00 frames. ], tot_loss[loss=0.223, simple_loss=0.2975, pruned_loss=0.07429, over 4267599.96 frames. ], batch size: 389, lr: 4.39e-03, grad_scale: 16.0 2023-06-24 17:10:23,085 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.23 vs. limit=15.0 2023-06-24 17:10:30,887 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.267e+02 2.893e+02 3.310e+02 4.075e+02 8.177e+02, threshold=6.621e+02, percent-clipped=1.0 2023-06-24 17:10:39,271 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1145262.0, ans=0.125 2023-06-24 17:11:11,424 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.73 vs. limit=15.0 2023-06-24 17:11:50,750 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1145442.0, ans=0.0 2023-06-24 17:11:54,581 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1145442.0, ans=0.0 2023-06-24 17:11:54,615 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1145442.0, ans=0.0 2023-06-24 17:12:02,927 INFO [train.py:996] (2/4) Epoch 7, batch 7950, loss[loss=0.2972, simple_loss=0.3662, pruned_loss=0.1141, over 21534.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.3002, pruned_loss=0.07341, over 4263932.89 frames. ], batch size: 507, lr: 4.39e-03, grad_scale: 16.0 2023-06-24 17:13:07,906 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1145622.0, ans=0.0 2023-06-24 17:13:54,792 INFO [train.py:996] (2/4) Epoch 7, batch 8000, loss[loss=0.289, simple_loss=0.3708, pruned_loss=0.1036, over 21481.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3049, pruned_loss=0.07534, over 4265958.75 frames. ], batch size: 471, lr: 4.39e-03, grad_scale: 32.0 2023-06-24 17:14:22,383 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.090e+02 2.777e+02 3.258e+02 3.899e+02 6.990e+02, threshold=6.515e+02, percent-clipped=3.0 2023-06-24 17:14:44,404 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1145862.0, ans=0.0 2023-06-24 17:15:20,205 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1145982.0, ans=0.1 2023-06-24 17:15:27,681 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1145982.0, ans=0.125 2023-06-24 17:15:57,262 INFO [train.py:996] (2/4) Epoch 7, batch 8050, loss[loss=0.1962, simple_loss=0.2676, pruned_loss=0.06243, over 21410.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.3065, pruned_loss=0.07584, over 4260701.87 frames. ], batch size: 194, lr: 4.39e-03, grad_scale: 16.0 2023-06-24 17:17:09,170 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.32 vs. limit=12.0 2023-06-24 17:17:48,335 INFO [train.py:996] (2/4) Epoch 7, batch 8100, loss[loss=0.1959, simple_loss=0.271, pruned_loss=0.06037, over 21152.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.307, pruned_loss=0.0767, over 4260958.92 frames. ], batch size: 607, lr: 4.39e-03, grad_scale: 16.0 2023-06-24 17:18:21,655 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.184e+02 3.042e+02 3.840e+02 5.397e+02 9.623e+02, threshold=7.680e+02, percent-clipped=13.0 2023-06-24 17:18:28,654 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1146462.0, ans=0.125 2023-06-24 17:18:41,905 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1146522.0, ans=0.125 2023-06-24 17:18:48,814 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1146522.0, ans=0.1 2023-06-24 17:18:54,298 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1146582.0, ans=0.1 2023-06-24 17:18:56,443 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1146582.0, ans=0.125 2023-06-24 17:18:58,912 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1146582.0, ans=0.1 2023-06-24 17:19:52,482 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1146642.0, ans=0.125 2023-06-24 17:19:55,093 INFO [train.py:996] (2/4) Epoch 7, batch 8150, loss[loss=0.1994, simple_loss=0.2812, pruned_loss=0.05883, over 21527.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3112, pruned_loss=0.0771, over 4267120.60 frames. ], batch size: 212, lr: 4.39e-03, grad_scale: 16.0 2023-06-24 17:20:03,563 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.32 vs. limit=6.0 2023-06-24 17:20:20,893 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1146762.0, ans=0.1 2023-06-24 17:20:27,624 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1146762.0, ans=0.0 2023-06-24 17:20:31,173 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1146822.0, ans=0.125 2023-06-24 17:21:15,148 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.90 vs. limit=22.5 2023-06-24 17:21:36,653 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1146942.0, ans=0.0 2023-06-24 17:21:44,387 INFO [train.py:996] (2/4) Epoch 7, batch 8200, loss[loss=0.2472, simple_loss=0.2948, pruned_loss=0.09982, over 21554.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3057, pruned_loss=0.07495, over 4265604.91 frames. ], batch size: 442, lr: 4.39e-03, grad_scale: 16.0 2023-06-24 17:22:06,171 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.043e+02 2.961e+02 3.959e+02 5.617e+02 1.113e+03, threshold=7.919e+02, percent-clipped=3.0 2023-06-24 17:22:31,317 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1147122.0, ans=0.125 2023-06-24 17:22:56,145 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1147182.0, ans=0.0 2023-06-24 17:23:28,218 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1147302.0, ans=0.125 2023-06-24 17:23:29,576 INFO [train.py:996] (2/4) Epoch 7, batch 8250, loss[loss=0.2074, simple_loss=0.3019, pruned_loss=0.05642, over 21428.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.3039, pruned_loss=0.07382, over 4264204.11 frames. ], batch size: 211, lr: 4.39e-03, grad_scale: 16.0 2023-06-24 17:23:43,115 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.18 vs. limit=10.0 2023-06-24 17:23:43,130 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.88 vs. limit=22.5 2023-06-24 17:23:59,311 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1147362.0, ans=0.95 2023-06-24 17:24:08,482 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=1147422.0, ans=15.0 2023-06-24 17:24:17,276 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1147422.0, ans=0.125 2023-06-24 17:24:31,283 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1147482.0, ans=0.125 2023-06-24 17:25:16,286 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1147542.0, ans=0.125 2023-06-24 17:25:22,783 INFO [train.py:996] (2/4) Epoch 7, batch 8300, loss[loss=0.2011, simple_loss=0.2819, pruned_loss=0.06013, over 21266.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.3031, pruned_loss=0.07171, over 4267898.53 frames. ], batch size: 176, lr: 4.39e-03, grad_scale: 16.0 2023-06-24 17:25:31,835 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1147602.0, ans=0.125 2023-06-24 17:25:43,624 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.968e+02 2.710e+02 3.107e+02 3.703e+02 5.803e+02, threshold=6.215e+02, percent-clipped=0.0 2023-06-24 17:26:10,539 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1147722.0, ans=0.1 2023-06-24 17:26:32,419 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1147782.0, ans=0.2 2023-06-24 17:26:37,415 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1147782.0, ans=0.125 2023-06-24 17:27:03,985 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1147842.0, ans=0.125 2023-06-24 17:27:12,199 INFO [train.py:996] (2/4) Epoch 7, batch 8350, loss[loss=0.199, simple_loss=0.2811, pruned_loss=0.05843, over 21080.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.3015, pruned_loss=0.06986, over 4260746.68 frames. ], batch size: 143, lr: 4.39e-03, grad_scale: 16.0 2023-06-24 17:27:18,292 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1147902.0, ans=0.125 2023-06-24 17:27:32,352 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1147962.0, ans=0.125 2023-06-24 17:27:39,819 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1147962.0, ans=0.0 2023-06-24 17:28:20,450 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.35 vs. limit=15.0 2023-06-24 17:28:53,971 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1148142.0, ans=0.035 2023-06-24 17:29:03,710 INFO [train.py:996] (2/4) Epoch 7, batch 8400, loss[loss=0.2356, simple_loss=0.3652, pruned_loss=0.05304, over 20798.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2982, pruned_loss=0.06727, over 4250691.22 frames. ], batch size: 607, lr: 4.39e-03, grad_scale: 32.0 2023-06-24 17:29:08,883 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.22 vs. limit=6.0 2023-06-24 17:29:25,439 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.909e+02 2.527e+02 3.220e+02 3.909e+02 1.035e+03, threshold=6.440e+02, percent-clipped=5.0 2023-06-24 17:29:30,514 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1148262.0, ans=0.1 2023-06-24 17:29:42,805 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1148322.0, ans=0.1 2023-06-24 17:29:58,130 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1148322.0, ans=0.125 2023-06-24 17:30:47,865 INFO [train.py:996] (2/4) Epoch 7, batch 8450, loss[loss=0.2349, simple_loss=0.3474, pruned_loss=0.06121, over 20906.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.2974, pruned_loss=0.0672, over 4261527.32 frames. ], batch size: 607, lr: 4.39e-03, grad_scale: 32.0 2023-06-24 17:32:29,932 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1148742.0, ans=0.2 2023-06-24 17:32:32,037 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.56 vs. limit=15.0 2023-06-24 17:32:36,618 INFO [train.py:996] (2/4) Epoch 7, batch 8500, loss[loss=0.2131, simple_loss=0.2761, pruned_loss=0.07505, over 21721.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.294, pruned_loss=0.0691, over 4258499.05 frames. ], batch size: 351, lr: 4.38e-03, grad_scale: 32.0 2023-06-24 17:32:37,264 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1148802.0, ans=0.0 2023-06-24 17:32:57,151 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.311e+02 2.839e+02 3.413e+02 4.005e+02 7.078e+02, threshold=6.826e+02, percent-clipped=2.0 2023-06-24 17:32:58,052 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1148862.0, ans=0.1 2023-06-24 17:34:02,696 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 17:34:13,279 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1149042.0, ans=0.125 2023-06-24 17:34:22,485 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1149042.0, ans=0.125 2023-06-24 17:34:26,836 INFO [train.py:996] (2/4) Epoch 7, batch 8550, loss[loss=0.2645, simple_loss=0.3504, pruned_loss=0.08931, over 21673.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.2984, pruned_loss=0.07208, over 4255942.37 frames. ], batch size: 441, lr: 4.38e-03, grad_scale: 32.0 2023-06-24 17:34:54,378 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1149162.0, ans=0.125 2023-06-24 17:35:25,023 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1149222.0, ans=0.0 2023-06-24 17:36:18,076 INFO [train.py:996] (2/4) Epoch 7, batch 8600, loss[loss=0.2646, simple_loss=0.3555, pruned_loss=0.08683, over 21289.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.3037, pruned_loss=0.07406, over 4260619.09 frames. ], batch size: 548, lr: 4.38e-03, grad_scale: 32.0 2023-06-24 17:36:39,540 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1149462.0, ans=0.1 2023-06-24 17:36:40,533 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.213e+02 3.018e+02 3.698e+02 4.926e+02 7.683e+02, threshold=7.396e+02, percent-clipped=5.0 2023-06-24 17:37:48,567 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1149642.0, ans=0.125 2023-06-24 17:37:58,818 INFO [train.py:996] (2/4) Epoch 7, batch 8650, loss[loss=0.169, simple_loss=0.2687, pruned_loss=0.03468, over 21776.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.3088, pruned_loss=0.07439, over 4265683.86 frames. ], batch size: 282, lr: 4.38e-03, grad_scale: 32.0 2023-06-24 17:38:03,823 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=1149702.0, ans=15.0 2023-06-24 17:39:10,233 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1149882.0, ans=0.1 2023-06-24 17:39:10,365 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1149882.0, ans=0.0 2023-06-24 17:39:14,453 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1149882.0, ans=0.07 2023-06-24 17:39:15,905 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1149882.0, ans=0.125 2023-06-24 17:39:29,699 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1149942.0, ans=0.2 2023-06-24 17:39:42,725 INFO [train.py:996] (2/4) Epoch 7, batch 8700, loss[loss=0.2036, simple_loss=0.27, pruned_loss=0.06855, over 21455.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.3022, pruned_loss=0.07166, over 4257104.19 frames. ], batch size: 389, lr: 4.38e-03, grad_scale: 16.0 2023-06-24 17:40:09,733 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.692e+02 2.588e+02 3.028e+02 3.644e+02 6.697e+02, threshold=6.057e+02, percent-clipped=0.0 2023-06-24 17:41:19,003 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1150242.0, ans=0.025 2023-06-24 17:41:22,043 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1150242.0, ans=0.125 2023-06-24 17:41:22,704 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=10.17 vs. limit=15.0 2023-06-24 17:41:30,731 INFO [train.py:996] (2/4) Epoch 7, batch 8750, loss[loss=0.2282, simple_loss=0.2953, pruned_loss=0.08049, over 21721.00 frames. ], tot_loss[loss=0.221, simple_loss=0.2979, pruned_loss=0.07209, over 4265445.43 frames. ], batch size: 389, lr: 4.38e-03, grad_scale: 16.0 2023-06-24 17:41:34,894 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1150302.0, ans=0.125 2023-06-24 17:42:08,048 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1150362.0, ans=0.0 2023-06-24 17:43:12,924 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.30 vs. limit=6.0 2023-06-24 17:43:15,870 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 17:43:22,456 INFO [train.py:996] (2/4) Epoch 7, batch 8800, loss[loss=0.2719, simple_loss=0.3491, pruned_loss=0.09738, over 21826.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.3058, pruned_loss=0.07462, over 4269986.57 frames. ], batch size: 124, lr: 4.38e-03, grad_scale: 32.0 2023-06-24 17:43:36,190 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1150602.0, ans=0.1 2023-06-24 17:44:02,362 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.260e+02 3.059e+02 3.780e+02 4.742e+02 8.855e+02, threshold=7.560e+02, percent-clipped=10.0 2023-06-24 17:44:06,966 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.27 vs. limit=12.0 2023-06-24 17:45:15,509 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1150842.0, ans=0.0 2023-06-24 17:45:24,899 INFO [train.py:996] (2/4) Epoch 7, batch 8850, loss[loss=0.2025, simple_loss=0.2939, pruned_loss=0.05548, over 21669.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.3104, pruned_loss=0.07597, over 4271031.46 frames. ], batch size: 247, lr: 4.38e-03, grad_scale: 32.0 2023-06-24 17:45:25,373 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1150902.0, ans=0.125 2023-06-24 17:45:32,315 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1150902.0, ans=0.2 2023-06-24 17:46:04,121 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1150962.0, ans=0.125 2023-06-24 17:47:16,027 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1151202.0, ans=0.1 2023-06-24 17:47:16,907 INFO [train.py:996] (2/4) Epoch 7, batch 8900, loss[loss=0.2037, simple_loss=0.285, pruned_loss=0.06121, over 21624.00 frames. ], tot_loss[loss=0.2276, simple_loss=0.3046, pruned_loss=0.07534, over 4268445.40 frames. ], batch size: 247, lr: 4.38e-03, grad_scale: 16.0 2023-06-24 17:47:36,455 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1151202.0, ans=0.2 2023-06-24 17:47:36,959 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.92 vs. limit=15.0 2023-06-24 17:47:42,482 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1151202.0, ans=0.125 2023-06-24 17:47:47,704 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1151262.0, ans=0.1 2023-06-24 17:47:54,310 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.186e+02 2.946e+02 3.604e+02 5.046e+02 1.118e+03, threshold=7.207e+02, percent-clipped=3.0 2023-06-24 17:48:04,485 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1151322.0, ans=0.125 2023-06-24 17:48:06,140 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1151322.0, ans=0.2 2023-06-24 17:48:50,902 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.40 vs. limit=15.0 2023-06-24 17:49:21,206 INFO [train.py:996] (2/4) Epoch 7, batch 8950, loss[loss=0.2269, simple_loss=0.3034, pruned_loss=0.07523, over 21890.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.3049, pruned_loss=0.07486, over 4273672.91 frames. ], batch size: 372, lr: 4.38e-03, grad_scale: 16.0 2023-06-24 17:49:34,685 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.81 vs. limit=15.0 2023-06-24 17:49:49,220 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1151562.0, ans=0.0 2023-06-24 17:49:52,838 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1151562.0, ans=0.015 2023-06-24 17:50:35,943 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1151682.0, ans=0.1 2023-06-24 17:51:10,368 INFO [train.py:996] (2/4) Epoch 7, batch 9000, loss[loss=0.2171, simple_loss=0.2826, pruned_loss=0.07578, over 22002.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.2996, pruned_loss=0.07443, over 4275047.74 frames. ], batch size: 103, lr: 4.38e-03, grad_scale: 16.0 2023-06-24 17:51:10,369 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-24 17:51:28,289 INFO [train.py:1028] (2/4) Epoch 7, validation: loss=0.2657, simple_loss=0.3576, pruned_loss=0.0869, over 1796401.00 frames. 2023-06-24 17:51:28,290 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 23731MB 2023-06-24 17:51:53,809 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.127e+02 2.929e+02 3.694e+02 4.955e+02 7.799e+02, threshold=7.388e+02, percent-clipped=3.0 2023-06-24 17:52:32,983 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1151982.0, ans=0.0 2023-06-24 17:53:21,715 INFO [train.py:996] (2/4) Epoch 7, batch 9050, loss[loss=0.2657, simple_loss=0.3353, pruned_loss=0.09803, over 21727.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.2967, pruned_loss=0.07213, over 4273264.65 frames. ], batch size: 441, lr: 4.38e-03, grad_scale: 16.0 2023-06-24 17:53:40,710 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1152162.0, ans=0.125 2023-06-24 17:53:53,879 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.31 vs. limit=15.0 2023-06-24 17:54:19,657 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1152222.0, ans=0.04949747468305833 2023-06-24 17:54:22,812 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1152222.0, ans=0.0 2023-06-24 17:55:14,793 INFO [train.py:996] (2/4) Epoch 7, batch 9100, loss[loss=0.2839, simple_loss=0.3512, pruned_loss=0.1083, over 21441.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.3035, pruned_loss=0.07444, over 4271280.81 frames. ], batch size: 471, lr: 4.38e-03, grad_scale: 16.0 2023-06-24 17:55:18,885 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1152402.0, ans=0.0 2023-06-24 17:55:22,513 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1152402.0, ans=0.2 2023-06-24 17:55:40,424 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 17:55:45,283 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.881e+02 2.655e+02 3.193e+02 3.861e+02 6.275e+02, threshold=6.386e+02, percent-clipped=0.0 2023-06-24 17:56:18,083 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1152582.0, ans=0.1 2023-06-24 17:56:31,285 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1152582.0, ans=0.035 2023-06-24 17:56:38,291 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1152582.0, ans=0.0 2023-06-24 17:56:45,673 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 17:57:01,011 INFO [train.py:996] (2/4) Epoch 7, batch 9150, loss[loss=0.2438, simple_loss=0.3344, pruned_loss=0.07657, over 21792.00 frames. ], tot_loss[loss=0.2257, simple_loss=0.3065, pruned_loss=0.07246, over 4268010.14 frames. ], batch size: 351, lr: 4.38e-03, grad_scale: 16.0 2023-06-24 17:57:51,849 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1152822.0, ans=0.1 2023-06-24 17:57:52,279 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.40 vs. limit=15.0 2023-06-24 17:57:57,031 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.71 vs. limit=15.0 2023-06-24 17:58:27,489 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.13 vs. limit=22.5 2023-06-24 17:58:46,940 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1152942.0, ans=0.04949747468305833 2023-06-24 17:58:58,734 INFO [train.py:996] (2/4) Epoch 7, batch 9200, loss[loss=0.2492, simple_loss=0.3339, pruned_loss=0.08225, over 21709.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.3098, pruned_loss=0.07192, over 4272338.93 frames. ], batch size: 351, lr: 4.38e-03, grad_scale: 32.0 2023-06-24 17:59:19,559 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1153062.0, ans=0.09899494936611666 2023-06-24 17:59:29,540 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.106e+02 2.740e+02 3.426e+02 4.320e+02 8.569e+02, threshold=6.853e+02, percent-clipped=6.0 2023-06-24 18:00:01,999 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.81 vs. limit=12.0 2023-06-24 18:00:09,465 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.93 vs. limit=15.0 2023-06-24 18:00:14,438 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1153182.0, ans=0.1 2023-06-24 18:00:24,134 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.02 vs. limit=5.0 2023-06-24 18:00:50,677 INFO [train.py:996] (2/4) Epoch 7, batch 9250, loss[loss=0.2145, simple_loss=0.2868, pruned_loss=0.07109, over 21758.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.3114, pruned_loss=0.07415, over 4266665.93 frames. ], batch size: 282, lr: 4.38e-03, grad_scale: 32.0 2023-06-24 18:01:29,714 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.69 vs. limit=15.0 2023-06-24 18:01:31,132 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1153362.0, ans=0.0 2023-06-24 18:01:32,920 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1153362.0, ans=0.125 2023-06-24 18:01:34,708 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1153362.0, ans=0.5 2023-06-24 18:01:36,670 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1153422.0, ans=0.125 2023-06-24 18:01:38,912 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.35 vs. limit=15.0 2023-06-24 18:01:48,090 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.13 vs. limit=15.0 2023-06-24 18:01:51,148 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1153422.0, ans=0.125 2023-06-24 18:01:59,844 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1153482.0, ans=0.1 2023-06-24 18:02:37,912 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1153542.0, ans=0.125 2023-06-24 18:02:42,906 INFO [train.py:996] (2/4) Epoch 7, batch 9300, loss[loss=0.2335, simple_loss=0.2997, pruned_loss=0.0837, over 19998.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.3051, pruned_loss=0.07382, over 4268559.95 frames. ], batch size: 702, lr: 4.38e-03, grad_scale: 32.0 2023-06-24 18:03:13,912 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.140e+02 3.058e+02 3.549e+02 4.364e+02 7.419e+02, threshold=7.098e+02, percent-clipped=2.0 2023-06-24 18:03:23,890 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=16.02 vs. limit=22.5 2023-06-24 18:03:52,027 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1153782.0, ans=0.2 2023-06-24 18:04:05,933 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1153782.0, ans=0.125 2023-06-24 18:04:13,596 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1153842.0, ans=0.05 2023-06-24 18:04:29,066 INFO [train.py:996] (2/4) Epoch 7, batch 9350, loss[loss=0.214, simple_loss=0.3303, pruned_loss=0.0489, over 20718.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.3109, pruned_loss=0.07416, over 4261402.29 frames. ], batch size: 607, lr: 4.37e-03, grad_scale: 32.0 2023-06-24 18:05:22,932 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1154022.0, ans=0.2 2023-06-24 18:05:57,118 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.26 vs. limit=10.0 2023-06-24 18:06:06,793 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1154142.0, ans=0.0 2023-06-24 18:06:31,747 INFO [train.py:996] (2/4) Epoch 7, batch 9400, loss[loss=0.2024, simple_loss=0.2629, pruned_loss=0.07096, over 21365.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.3115, pruned_loss=0.07436, over 4270520.44 frames. ], batch size: 194, lr: 4.37e-03, grad_scale: 32.0 2023-06-24 18:06:50,250 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1154202.0, ans=0.1 2023-06-24 18:07:02,161 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.141e+02 2.873e+02 3.280e+02 3.858e+02 8.681e+02, threshold=6.561e+02, percent-clipped=2.0 2023-06-24 18:07:04,370 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1154262.0, ans=0.125 2023-06-24 18:07:21,035 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.55 vs. limit=12.0 2023-06-24 18:08:21,776 INFO [train.py:996] (2/4) Epoch 7, batch 9450, loss[loss=0.1861, simple_loss=0.2408, pruned_loss=0.06574, over 21354.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.3037, pruned_loss=0.07344, over 4274153.88 frames. ], batch size: 551, lr: 4.37e-03, grad_scale: 16.0 2023-06-24 18:08:22,223 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1154502.0, ans=0.1 2023-06-24 18:08:35,771 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1154502.0, ans=0.125 2023-06-24 18:08:46,472 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1154562.0, ans=0.0 2023-06-24 18:08:56,977 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=1154562.0, ans=0.025 2023-06-24 18:09:35,168 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1154682.0, ans=0.125 2023-06-24 18:10:10,153 INFO [train.py:996] (2/4) Epoch 7, batch 9500, loss[loss=0.1851, simple_loss=0.2715, pruned_loss=0.04933, over 21728.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.2946, pruned_loss=0.07119, over 4274181.29 frames. ], batch size: 247, lr: 4.37e-03, grad_scale: 16.0 2023-06-24 18:10:42,232 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.100e+02 2.886e+02 3.476e+02 4.165e+02 8.781e+02, threshold=6.953e+02, percent-clipped=4.0 2023-06-24 18:11:13,653 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_ff3.min_abs, batch_count=1154982.0, ans=0.2 2023-06-24 18:11:24,401 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1154982.0, ans=0.025 2023-06-24 18:11:46,516 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.60 vs. limit=22.5 2023-06-24 18:12:01,005 INFO [train.py:996] (2/4) Epoch 7, batch 9550, loss[loss=0.2672, simple_loss=0.3303, pruned_loss=0.1021, over 21392.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.3013, pruned_loss=0.0732, over 4275254.85 frames. ], batch size: 549, lr: 4.37e-03, grad_scale: 16.0 2023-06-24 18:12:56,878 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1155222.0, ans=0.125 2023-06-24 18:13:50,415 INFO [train.py:996] (2/4) Epoch 7, batch 9600, loss[loss=0.2223, simple_loss=0.3217, pruned_loss=0.06147, over 20777.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.3044, pruned_loss=0.07528, over 4280687.20 frames. ], batch size: 607, lr: 4.37e-03, grad_scale: 32.0 2023-06-24 18:13:54,855 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1155402.0, ans=0.125 2023-06-24 18:14:23,105 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.186e+02 3.053e+02 3.563e+02 4.666e+02 8.626e+02, threshold=7.126e+02, percent-clipped=5.0 2023-06-24 18:14:29,818 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1155522.0, ans=0.1 2023-06-24 18:14:29,831 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1155522.0, ans=0.1 2023-06-24 18:15:45,077 INFO [train.py:996] (2/4) Epoch 7, batch 9650, loss[loss=0.2726, simple_loss=0.3512, pruned_loss=0.09699, over 21484.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.3045, pruned_loss=0.07592, over 4285764.64 frames. ], batch size: 131, lr: 4.37e-03, grad_scale: 16.0 2023-06-24 18:16:12,600 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=7.29 vs. limit=12.0 2023-06-24 18:17:04,479 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1155882.0, ans=0.1 2023-06-24 18:17:04,496 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1155882.0, ans=0.0 2023-06-24 18:17:28,679 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.16 vs. limit=10.0 2023-06-24 18:17:31,921 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1155942.0, ans=0.1 2023-06-24 18:17:34,804 INFO [train.py:996] (2/4) Epoch 7, batch 9700, loss[loss=0.207, simple_loss=0.2819, pruned_loss=0.06606, over 21360.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.3075, pruned_loss=0.07631, over 4288627.19 frames. ], batch size: 159, lr: 4.37e-03, grad_scale: 16.0 2023-06-24 18:17:37,116 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1156002.0, ans=0.125 2023-06-24 18:17:52,686 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1156062.0, ans=0.125 2023-06-24 18:18:08,283 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.126e+02 2.706e+02 3.025e+02 3.673e+02 7.479e+02, threshold=6.049e+02, percent-clipped=1.0 2023-06-24 18:18:20,493 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.36 vs. limit=8.0 2023-06-24 18:19:11,804 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1156242.0, ans=0.1 2023-06-24 18:19:18,095 INFO [train.py:996] (2/4) Epoch 7, batch 9750, loss[loss=0.22, simple_loss=0.2683, pruned_loss=0.08582, over 21464.00 frames. ], tot_loss[loss=0.226, simple_loss=0.3013, pruned_loss=0.07534, over 4289142.43 frames. ], batch size: 476, lr: 4.37e-03, grad_scale: 16.0 2023-06-24 18:19:41,594 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1156362.0, ans=0.125 2023-06-24 18:19:41,793 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1156362.0, ans=0.2 2023-06-24 18:19:59,254 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1156422.0, ans=0.125 2023-06-24 18:20:15,405 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1156482.0, ans=0.5 2023-06-24 18:20:15,789 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.48 vs. limit=15.0 2023-06-24 18:21:07,437 INFO [train.py:996] (2/4) Epoch 7, batch 9800, loss[loss=0.2123, simple_loss=0.2806, pruned_loss=0.072, over 20091.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.3007, pruned_loss=0.07488, over 4271964.65 frames. ], batch size: 703, lr: 4.37e-03, grad_scale: 16.0 2023-06-24 18:21:25,699 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1156662.0, ans=0.125 2023-06-24 18:21:30,701 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1156662.0, ans=0.1 2023-06-24 18:21:32,271 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1156662.0, ans=0.1 2023-06-24 18:21:39,994 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.170e+02 2.762e+02 3.059e+02 4.077e+02 6.018e+02, threshold=6.118e+02, percent-clipped=0.0 2023-06-24 18:22:18,207 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1156782.0, ans=0.125 2023-06-24 18:22:39,603 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1156842.0, ans=0.1 2023-06-24 18:22:54,351 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1156902.0, ans=0.035 2023-06-24 18:22:55,802 INFO [train.py:996] (2/4) Epoch 7, batch 9850, loss[loss=0.2036, simple_loss=0.2767, pruned_loss=0.06527, over 21889.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.2973, pruned_loss=0.07484, over 4263489.67 frames. ], batch size: 333, lr: 4.37e-03, grad_scale: 16.0 2023-06-24 18:23:11,783 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1156962.0, ans=0.125 2023-06-24 18:23:27,208 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.09 vs. limit=15.0 2023-06-24 18:23:45,674 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1157022.0, ans=0.1 2023-06-24 18:23:56,165 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1157082.0, ans=0.0 2023-06-24 18:24:38,515 INFO [train.py:996] (2/4) Epoch 7, batch 9900, loss[loss=0.2371, simple_loss=0.3064, pruned_loss=0.08384, over 21424.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.2932, pruned_loss=0.07446, over 4261184.88 frames. ], batch size: 211, lr: 4.37e-03, grad_scale: 16.0 2023-06-24 18:25:12,376 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.163e+02 2.791e+02 3.369e+02 4.122e+02 6.726e+02, threshold=6.739e+02, percent-clipped=1.0 2023-06-24 18:25:38,596 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1157382.0, ans=0.125 2023-06-24 18:25:39,235 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.26 vs. limit=15.0 2023-06-24 18:26:27,529 INFO [train.py:996] (2/4) Epoch 7, batch 9950, loss[loss=0.1973, simple_loss=0.2629, pruned_loss=0.0658, over 21565.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.2943, pruned_loss=0.07622, over 4263535.33 frames. ], batch size: 263, lr: 4.37e-03, grad_scale: 16.0 2023-06-24 18:26:27,954 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1157502.0, ans=0.0 2023-06-24 18:26:43,440 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1157562.0, ans=0.0 2023-06-24 18:27:33,648 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.10 vs. limit=12.0 2023-06-24 18:28:16,564 INFO [train.py:996] (2/4) Epoch 7, batch 10000, loss[loss=0.2318, simple_loss=0.3059, pruned_loss=0.07882, over 21367.00 frames. ], tot_loss[loss=0.2192, simple_loss=0.2896, pruned_loss=0.07444, over 4270588.90 frames. ], batch size: 549, lr: 4.37e-03, grad_scale: 32.0 2023-06-24 18:28:26,329 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1157802.0, ans=0.125 2023-06-24 18:28:45,426 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1157862.0, ans=0.125 2023-06-24 18:28:49,777 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.007e+02 2.643e+02 3.254e+02 4.440e+02 7.063e+02, threshold=6.507e+02, percent-clipped=1.0 2023-06-24 18:29:50,285 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.50 vs. limit=15.0 2023-06-24 18:30:04,092 INFO [train.py:996] (2/4) Epoch 7, batch 10050, loss[loss=0.297, simple_loss=0.3518, pruned_loss=0.121, over 21436.00 frames. ], tot_loss[loss=0.223, simple_loss=0.2938, pruned_loss=0.07614, over 4270282.81 frames. ], batch size: 509, lr: 4.37e-03, grad_scale: 32.0 2023-06-24 18:30:50,164 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1158222.0, ans=0.07 2023-06-24 18:31:47,525 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1158342.0, ans=0.125 2023-06-24 18:32:01,215 INFO [train.py:996] (2/4) Epoch 7, batch 10100, loss[loss=0.1958, simple_loss=0.2762, pruned_loss=0.05767, over 20881.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.2906, pruned_loss=0.07347, over 4276358.20 frames. ], batch size: 608, lr: 4.37e-03, grad_scale: 16.0 2023-06-24 18:32:30,798 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.965e+02 2.650e+02 3.073e+02 3.822e+02 6.259e+02, threshold=6.145e+02, percent-clipped=0.0 2023-06-24 18:32:54,916 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1158522.0, ans=0.1 2023-06-24 18:33:15,629 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1158582.0, ans=0.1 2023-06-24 18:33:48,278 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.46 vs. limit=10.0 2023-06-24 18:33:50,312 INFO [train.py:996] (2/4) Epoch 7, batch 10150, loss[loss=0.222, simple_loss=0.3078, pruned_loss=0.06812, over 21805.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.2969, pruned_loss=0.07533, over 4265773.47 frames. ], batch size: 282, lr: 4.37e-03, grad_scale: 8.0 2023-06-24 18:34:00,529 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1158702.0, ans=0.1 2023-06-24 18:34:19,397 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1158762.0, ans=0.125 2023-06-24 18:34:48,845 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.66 vs. limit=22.5 2023-06-24 18:34:55,258 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1158822.0, ans=0.0 2023-06-24 18:35:21,399 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1158942.0, ans=0.125 2023-06-24 18:35:39,201 INFO [train.py:996] (2/4) Epoch 7, batch 10200, loss[loss=0.2181, simple_loss=0.3055, pruned_loss=0.06541, over 21705.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.2948, pruned_loss=0.07293, over 4271067.40 frames. ], batch size: 415, lr: 4.37e-03, grad_scale: 8.0 2023-06-24 18:35:58,223 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=1159062.0, ans=10.0 2023-06-24 18:36:17,255 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.114e+02 2.567e+02 2.979e+02 3.564e+02 7.472e+02, threshold=5.959e+02, percent-clipped=1.0 2023-06-24 18:36:21,217 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1159122.0, ans=0.0 2023-06-24 18:36:37,039 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1159122.0, ans=0.035 2023-06-24 18:36:47,266 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1159182.0, ans=0.125 2023-06-24 18:37:28,864 INFO [train.py:996] (2/4) Epoch 7, batch 10250, loss[loss=0.196, simple_loss=0.2864, pruned_loss=0.05278, over 21865.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.2903, pruned_loss=0.06797, over 4271441.12 frames. ], batch size: 372, lr: 4.36e-03, grad_scale: 8.0 2023-06-24 18:37:54,460 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1159362.0, ans=0.125 2023-06-24 18:38:17,642 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1159422.0, ans=0.125 2023-06-24 18:38:38,889 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1159482.0, ans=0.1 2023-06-24 18:38:44,864 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1159482.0, ans=0.0 2023-06-24 18:38:46,836 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1159482.0, ans=0.0 2023-06-24 18:39:22,124 INFO [train.py:996] (2/4) Epoch 7, batch 10300, loss[loss=0.2298, simple_loss=0.2995, pruned_loss=0.08006, over 19983.00 frames. ], tot_loss[loss=0.2172, simple_loss=0.2944, pruned_loss=0.07003, over 4271462.99 frames. ], batch size: 703, lr: 4.36e-03, grad_scale: 8.0 2023-06-24 18:39:26,062 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1159602.0, ans=0.0 2023-06-24 18:39:31,572 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1159602.0, ans=0.0 2023-06-24 18:40:11,434 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.745e+02 2.687e+02 3.369e+02 4.671e+02 1.084e+03, threshold=6.737e+02, percent-clipped=9.0 2023-06-24 18:40:17,658 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1159722.0, ans=0.1 2023-06-24 18:41:14,519 INFO [train.py:996] (2/4) Epoch 7, batch 10350, loss[loss=0.1722, simple_loss=0.2188, pruned_loss=0.06283, over 21243.00 frames. ], tot_loss[loss=0.2167, simple_loss=0.2945, pruned_loss=0.06941, over 4272122.02 frames. ], batch size: 143, lr: 4.36e-03, grad_scale: 8.0 2023-06-24 18:41:29,158 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.21 vs. limit=22.5 2023-06-24 18:41:30,532 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1159902.0, ans=0.125 2023-06-24 18:41:59,760 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1159962.0, ans=0.125 2023-06-24 18:42:06,585 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1160022.0, ans=0.125 2023-06-24 18:42:10,717 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1160022.0, ans=0.125 2023-06-24 18:42:12,388 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1160022.0, ans=0.0 2023-06-24 18:42:54,927 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1160142.0, ans=0.1 2023-06-24 18:43:12,832 INFO [train.py:996] (2/4) Epoch 7, batch 10400, loss[loss=0.1748, simple_loss=0.2285, pruned_loss=0.06059, over 21296.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.2885, pruned_loss=0.0689, over 4262524.90 frames. ], batch size: 176, lr: 4.36e-03, grad_scale: 16.0 2023-06-24 18:43:25,123 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1160202.0, ans=0.125 2023-06-24 18:43:36,869 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1160202.0, ans=0.125 2023-06-24 18:43:45,918 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1160262.0, ans=0.2 2023-06-24 18:43:47,732 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1160262.0, ans=0.125 2023-06-24 18:43:56,281 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.199e+02 2.812e+02 3.590e+02 4.501e+02 9.958e+02, threshold=7.181e+02, percent-clipped=6.0 2023-06-24 18:44:15,344 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.42 vs. limit=15.0 2023-06-24 18:44:20,757 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1160322.0, ans=0.1 2023-06-24 18:44:44,069 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1160442.0, ans=0.0 2023-06-24 18:45:15,896 INFO [train.py:996] (2/4) Epoch 7, batch 10450, loss[loss=0.2203, simple_loss=0.3215, pruned_loss=0.05959, over 20826.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.2939, pruned_loss=0.07173, over 4265958.18 frames. ], batch size: 608, lr: 4.36e-03, grad_scale: 16.0 2023-06-24 18:45:21,566 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1160502.0, ans=0.125 2023-06-24 18:47:06,308 INFO [train.py:996] (2/4) Epoch 7, batch 10500, loss[loss=0.2012, simple_loss=0.2653, pruned_loss=0.06857, over 21658.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.2944, pruned_loss=0.07075, over 4259321.88 frames. ], batch size: 298, lr: 4.36e-03, grad_scale: 16.0 2023-06-24 18:47:29,417 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1160862.0, ans=0.125 2023-06-24 18:47:43,179 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.152e+02 2.810e+02 3.423e+02 4.183e+02 6.636e+02, threshold=6.845e+02, percent-clipped=0.0 2023-06-24 18:48:46,562 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1161042.0, ans=0.07 2023-06-24 18:48:48,629 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 18:48:54,920 INFO [train.py:996] (2/4) Epoch 7, batch 10550, loss[loss=0.2405, simple_loss=0.2807, pruned_loss=0.1002, over 21307.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.2887, pruned_loss=0.0701, over 4268869.80 frames. ], batch size: 507, lr: 4.36e-03, grad_scale: 16.0 2023-06-24 18:49:11,745 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.82 vs. limit=15.0 2023-06-24 18:49:28,395 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1161222.0, ans=0.05 2023-06-24 18:49:56,064 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1161282.0, ans=0.1 2023-06-24 18:50:25,609 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1161342.0, ans=0.0 2023-06-24 18:50:40,487 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1161342.0, ans=0.0 2023-06-24 18:50:42,068 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1161342.0, ans=0.125 2023-06-24 18:50:46,825 INFO [train.py:996] (2/4) Epoch 7, batch 10600, loss[loss=0.1832, simple_loss=0.2509, pruned_loss=0.0578, over 15371.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.2836, pruned_loss=0.06843, over 4264183.69 frames. ], batch size: 62, lr: 4.36e-03, grad_scale: 16.0 2023-06-24 18:50:56,998 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1161402.0, ans=0.125 2023-06-24 18:51:00,480 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1161402.0, ans=0.125 2023-06-24 18:51:24,993 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.905e+02 2.607e+02 2.934e+02 3.561e+02 5.999e+02, threshold=5.868e+02, percent-clipped=0.0 2023-06-24 18:51:40,743 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.84 vs. limit=15.0 2023-06-24 18:52:00,766 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1161582.0, ans=0.0 2023-06-24 18:52:38,835 INFO [train.py:996] (2/4) Epoch 7, batch 10650, loss[loss=0.1944, simple_loss=0.3142, pruned_loss=0.03727, over 19979.00 frames. ], tot_loss[loss=0.2115, simple_loss=0.2877, pruned_loss=0.06764, over 4270905.26 frames. ], batch size: 703, lr: 4.36e-03, grad_scale: 16.0 2023-06-24 18:53:52,917 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1161882.0, ans=0.0 2023-06-24 18:54:29,888 INFO [train.py:996] (2/4) Epoch 7, batch 10700, loss[loss=0.2568, simple_loss=0.3348, pruned_loss=0.08941, over 21404.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2867, pruned_loss=0.06704, over 4261453.07 frames. ], batch size: 131, lr: 4.36e-03, grad_scale: 16.0 2023-06-24 18:55:08,597 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.005e+02 2.935e+02 3.419e+02 4.511e+02 9.695e+02, threshold=6.839e+02, percent-clipped=12.0 2023-06-24 18:55:10,933 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1162122.0, ans=0.2 2023-06-24 18:55:14,781 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1162122.0, ans=0.0 2023-06-24 18:56:01,605 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.02 vs. limit=10.0 2023-06-24 18:56:05,465 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1162242.0, ans=0.0 2023-06-24 18:56:27,815 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.91 vs. limit=10.0 2023-06-24 18:56:29,694 INFO [train.py:996] (2/4) Epoch 7, batch 10750, loss[loss=0.2458, simple_loss=0.3433, pruned_loss=0.07415, over 21762.00 frames. ], tot_loss[loss=0.2199, simple_loss=0.2973, pruned_loss=0.07128, over 4264479.10 frames. ], batch size: 332, lr: 4.36e-03, grad_scale: 16.0 2023-06-24 18:56:30,279 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1162302.0, ans=0.125 2023-06-24 18:56:36,604 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.94 vs. limit=6.0 2023-06-24 18:57:18,197 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1162422.0, ans=0.04949747468305833 2023-06-24 18:57:18,223 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1162422.0, ans=0.125 2023-06-24 18:57:57,392 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1162542.0, ans=0.1 2023-06-24 18:58:20,587 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1162602.0, ans=0.035 2023-06-24 18:58:21,623 INFO [train.py:996] (2/4) Epoch 7, batch 10800, loss[loss=0.2345, simple_loss=0.311, pruned_loss=0.07901, over 21561.00 frames. ], tot_loss[loss=0.222, simple_loss=0.3006, pruned_loss=0.07166, over 4263437.75 frames. ], batch size: 230, lr: 4.36e-03, grad_scale: 32.0 2023-06-24 18:58:26,391 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1162602.0, ans=0.125 2023-06-24 18:58:53,761 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.73 vs. limit=15.0 2023-06-24 18:59:06,340 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.283e+02 2.815e+02 3.156e+02 3.825e+02 7.344e+02, threshold=6.312e+02, percent-clipped=1.0 2023-06-24 19:00:07,159 INFO [train.py:996] (2/4) Epoch 7, batch 10850, loss[loss=0.2113, simple_loss=0.2743, pruned_loss=0.07417, over 21136.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.3012, pruned_loss=0.07244, over 4258629.31 frames. ], batch size: 143, lr: 4.36e-03, grad_scale: 16.0 2023-06-24 19:00:10,913 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1162902.0, ans=0.0 2023-06-24 19:00:44,135 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.76 vs. limit=22.5 2023-06-24 19:01:00,928 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1163022.0, ans=0.1 2023-06-24 19:02:04,082 INFO [train.py:996] (2/4) Epoch 7, batch 10900, loss[loss=0.2085, simple_loss=0.3256, pruned_loss=0.04567, over 21198.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2941, pruned_loss=0.07055, over 4258897.95 frames. ], batch size: 548, lr: 4.36e-03, grad_scale: 16.0 2023-06-24 19:02:08,038 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1163202.0, ans=10.0 2023-06-24 19:02:39,802 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1163262.0, ans=0.125 2023-06-24 19:02:43,659 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1163262.0, ans=0.125 2023-06-24 19:02:47,855 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.963e+02 2.711e+02 3.083e+02 3.861e+02 1.043e+03, threshold=6.166e+02, percent-clipped=5.0 2023-06-24 19:02:59,650 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.88 vs. limit=12.0 2023-06-24 19:03:22,813 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.93 vs. limit=15.0 2023-06-24 19:03:53,385 INFO [train.py:996] (2/4) Epoch 7, batch 10950, loss[loss=0.1788, simple_loss=0.2492, pruned_loss=0.05427, over 21229.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.2901, pruned_loss=0.06885, over 4266493.47 frames. ], batch size: 144, lr: 4.36e-03, grad_scale: 16.0 2023-06-24 19:04:09,157 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1163502.0, ans=0.125 2023-06-24 19:04:13,940 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1163502.0, ans=0.025 2023-06-24 19:04:32,063 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=13.71 vs. limit=15.0 2023-06-24 19:05:10,186 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1163682.0, ans=0.2 2023-06-24 19:05:42,606 INFO [train.py:996] (2/4) Epoch 7, batch 11000, loss[loss=0.266, simple_loss=0.3131, pruned_loss=0.1095, over 21772.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.2891, pruned_loss=0.06938, over 4256917.40 frames. ], batch size: 508, lr: 4.36e-03, grad_scale: 16.0 2023-06-24 19:05:43,384 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1163802.0, ans=0.125 2023-06-24 19:06:06,823 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.28 vs. limit=22.5 2023-06-24 19:06:25,242 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1163862.0, ans=0.09899494936611666 2023-06-24 19:06:26,189 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.210e+02 2.764e+02 3.110e+02 3.886e+02 6.584e+02, threshold=6.221e+02, percent-clipped=1.0 2023-06-24 19:07:25,231 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1164042.0, ans=0.125 2023-06-24 19:07:31,776 INFO [train.py:996] (2/4) Epoch 7, batch 11050, loss[loss=0.1965, simple_loss=0.2487, pruned_loss=0.07218, over 21283.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.2866, pruned_loss=0.07048, over 4259821.72 frames. ], batch size: 548, lr: 4.36e-03, grad_scale: 16.0 2023-06-24 19:08:08,074 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1164162.0, ans=0.95 2023-06-24 19:08:31,670 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1164282.0, ans=0.0 2023-06-24 19:09:17,961 INFO [train.py:996] (2/4) Epoch 7, batch 11100, loss[loss=0.2003, simple_loss=0.2691, pruned_loss=0.06579, over 21800.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.2861, pruned_loss=0.07076, over 4245108.14 frames. ], batch size: 112, lr: 4.36e-03, grad_scale: 16.0 2023-06-24 19:09:31,776 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1164402.0, ans=0.0 2023-06-24 19:10:00,567 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.194e+02 2.678e+02 3.103e+02 3.561e+02 5.692e+02, threshold=6.205e+02, percent-clipped=0.0 2023-06-24 19:10:02,755 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1164522.0, ans=0.1 2023-06-24 19:10:45,786 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1164642.0, ans=0.0 2023-06-24 19:11:05,146 INFO [train.py:996] (2/4) Epoch 7, batch 11150, loss[loss=0.2266, simple_loss=0.309, pruned_loss=0.07208, over 21585.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.2846, pruned_loss=0.07034, over 4239850.02 frames. ], batch size: 414, lr: 4.35e-03, grad_scale: 16.0 2023-06-24 19:12:52,253 INFO [train.py:996] (2/4) Epoch 7, batch 11200, loss[loss=0.2258, simple_loss=0.2795, pruned_loss=0.08604, over 21217.00 frames. ], tot_loss[loss=0.211, simple_loss=0.2824, pruned_loss=0.06978, over 4227833.76 frames. ], batch size: 471, lr: 4.35e-03, grad_scale: 32.0 2023-06-24 19:13:09,203 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.28 vs. limit=12.0 2023-06-24 19:13:27,793 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1165062.0, ans=0.125 2023-06-24 19:13:29,848 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.23 vs. limit=15.0 2023-06-24 19:13:35,981 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.061e+02 2.570e+02 2.865e+02 3.266e+02 5.455e+02, threshold=5.730e+02, percent-clipped=0.0 2023-06-24 19:13:59,026 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1165182.0, ans=0.0 2023-06-24 19:14:00,510 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1165182.0, ans=0.125 2023-06-24 19:14:16,598 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.57 vs. limit=15.0 2023-06-24 19:14:24,971 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1165242.0, ans=0.0 2023-06-24 19:14:25,066 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1165242.0, ans=0.125 2023-06-24 19:14:41,030 INFO [train.py:996] (2/4) Epoch 7, batch 11250, loss[loss=0.1986, simple_loss=0.2922, pruned_loss=0.05253, over 21565.00 frames. ], tot_loss[loss=0.2113, simple_loss=0.282, pruned_loss=0.07025, over 4241254.95 frames. ], batch size: 195, lr: 4.35e-03, grad_scale: 32.0 2023-06-24 19:14:46,000 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=8.71 vs. limit=12.0 2023-06-24 19:15:24,773 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1165362.0, ans=0.125 2023-06-24 19:15:37,319 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1165422.0, ans=0.125 2023-06-24 19:15:52,839 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1165482.0, ans=0.0 2023-06-24 19:16:02,181 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.17 vs. limit=22.5 2023-06-24 19:16:05,093 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1165482.0, ans=0.0 2023-06-24 19:16:05,137 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1165482.0, ans=0.125 2023-06-24 19:16:19,055 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1165542.0, ans=0.125 2023-06-24 19:16:31,015 INFO [train.py:996] (2/4) Epoch 7, batch 11300, loss[loss=0.2237, simple_loss=0.3021, pruned_loss=0.0726, over 21846.00 frames. ], tot_loss[loss=0.2122, simple_loss=0.2841, pruned_loss=0.07019, over 4253107.84 frames. ], batch size: 98, lr: 4.35e-03, grad_scale: 32.0 2023-06-24 19:17:13,944 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.273e+02 2.821e+02 3.305e+02 4.579e+02 7.835e+02, threshold=6.611e+02, percent-clipped=6.0 2023-06-24 19:17:23,413 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1165722.0, ans=0.125 2023-06-24 19:17:34,435 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1165782.0, ans=0.0 2023-06-24 19:17:53,616 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1165782.0, ans=0.2 2023-06-24 19:18:19,904 INFO [train.py:996] (2/4) Epoch 7, batch 11350, loss[loss=0.244, simple_loss=0.3276, pruned_loss=0.0802, over 21619.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.2869, pruned_loss=0.06972, over 4262801.24 frames. ], batch size: 389, lr: 4.35e-03, grad_scale: 32.0 2023-06-24 19:19:09,138 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1166022.0, ans=0.0 2023-06-24 19:20:11,170 INFO [train.py:996] (2/4) Epoch 7, batch 11400, loss[loss=0.2166, simple_loss=0.301, pruned_loss=0.06605, over 21831.00 frames. ], tot_loss[loss=0.2183, simple_loss=0.2922, pruned_loss=0.07224, over 4264320.96 frames. ], batch size: 282, lr: 4.35e-03, grad_scale: 16.0 2023-06-24 19:20:56,089 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.109e+02 2.882e+02 3.810e+02 4.991e+02 7.494e+02, threshold=7.619e+02, percent-clipped=6.0 2023-06-24 19:21:16,339 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1166322.0, ans=0.1 2023-06-24 19:22:01,562 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1166442.0, ans=0.0 2023-06-24 19:22:06,426 INFO [train.py:996] (2/4) Epoch 7, batch 11450, loss[loss=0.2445, simple_loss=0.3408, pruned_loss=0.07413, over 21230.00 frames. ], tot_loss[loss=0.2181, simple_loss=0.293, pruned_loss=0.07159, over 4269839.06 frames. ], batch size: 549, lr: 4.35e-03, grad_scale: 16.0 2023-06-24 19:22:40,534 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1166562.0, ans=0.125 2023-06-24 19:23:02,149 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1166622.0, ans=0.125 2023-06-24 19:23:59,147 INFO [train.py:996] (2/4) Epoch 7, batch 11500, loss[loss=0.1885, simple_loss=0.2413, pruned_loss=0.06787, over 20784.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.2958, pruned_loss=0.0727, over 4263279.80 frames. ], batch size: 608, lr: 4.35e-03, grad_scale: 16.0 2023-06-24 19:24:44,988 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.002e+02 2.827e+02 3.371e+02 4.045e+02 6.932e+02, threshold=6.743e+02, percent-clipped=0.0 2023-06-24 19:25:36,966 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1167042.0, ans=0.0 2023-06-24 19:25:39,617 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.83 vs. limit=15.0 2023-06-24 19:25:56,895 INFO [train.py:996] (2/4) Epoch 7, batch 11550, loss[loss=0.2619, simple_loss=0.3673, pruned_loss=0.07824, over 21249.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.3029, pruned_loss=0.07371, over 4263582.86 frames. ], batch size: 548, lr: 4.35e-03, grad_scale: 16.0 2023-06-24 19:26:17,798 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 19:26:21,558 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1167162.0, ans=0.015 2023-06-24 19:27:08,308 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1167282.0, ans=0.0 2023-06-24 19:27:09,745 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1167282.0, ans=0.125 2023-06-24 19:27:48,872 INFO [train.py:996] (2/4) Epoch 7, batch 11600, loss[loss=0.2864, simple_loss=0.3887, pruned_loss=0.09209, over 21636.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.3166, pruned_loss=0.07532, over 4262483.80 frames. ], batch size: 441, lr: 4.35e-03, grad_scale: 32.0 2023-06-24 19:28:34,723 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.256e+02 2.839e+02 3.611e+02 4.809e+02 8.575e+02, threshold=7.221e+02, percent-clipped=4.0 2023-06-24 19:28:51,250 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1167522.0, ans=0.0 2023-06-24 19:28:59,640 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1167582.0, ans=0.125 2023-06-24 19:29:20,273 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1167642.0, ans=0.1 2023-06-24 19:29:42,836 INFO [train.py:996] (2/4) Epoch 7, batch 11650, loss[loss=0.3524, simple_loss=0.4169, pruned_loss=0.1439, over 21437.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3226, pruned_loss=0.07603, over 4261841.87 frames. ], batch size: 507, lr: 4.35e-03, grad_scale: 32.0 2023-06-24 19:30:19,532 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1167762.0, ans=0.2 2023-06-24 19:30:37,055 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1167822.0, ans=0.05 2023-06-24 19:30:47,581 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1167882.0, ans=0.0 2023-06-24 19:31:16,936 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1167942.0, ans=0.125 2023-06-24 19:31:33,866 INFO [train.py:996] (2/4) Epoch 7, batch 11700, loss[loss=0.1996, simple_loss=0.2757, pruned_loss=0.06179, over 21725.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3144, pruned_loss=0.07553, over 4270263.77 frames. ], batch size: 112, lr: 4.35e-03, grad_scale: 32.0 2023-06-24 19:32:07,012 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1168062.0, ans=0.125 2023-06-24 19:32:16,018 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.211e+02 2.666e+02 3.050e+02 3.571e+02 8.433e+02, threshold=6.100e+02, percent-clipped=2.0 2023-06-24 19:33:21,099 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1168302.0, ans=0.2 2023-06-24 19:33:22,091 INFO [train.py:996] (2/4) Epoch 7, batch 11750, loss[loss=0.213, simple_loss=0.2834, pruned_loss=0.0713, over 21665.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.3051, pruned_loss=0.07478, over 4275565.78 frames. ], batch size: 298, lr: 4.35e-03, grad_scale: 16.0 2023-06-24 19:33:26,238 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1168302.0, ans=0.1 2023-06-24 19:34:21,377 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1168422.0, ans=0.1 2023-06-24 19:35:13,454 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1168602.0, ans=0.125 2023-06-24 19:35:14,538 INFO [train.py:996] (2/4) Epoch 7, batch 11800, loss[loss=0.2345, simple_loss=0.3061, pruned_loss=0.08147, over 21757.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.3052, pruned_loss=0.07698, over 4279636.26 frames. ], batch size: 332, lr: 4.35e-03, grad_scale: 16.0 2023-06-24 19:36:03,569 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.166e+02 2.959e+02 3.685e+02 4.448e+02 7.783e+02, threshold=7.370e+02, percent-clipped=3.0 2023-06-24 19:36:16,428 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.22 vs. limit=15.0 2023-06-24 19:36:28,363 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1168782.0, ans=0.0 2023-06-24 19:36:41,047 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1168842.0, ans=0.0 2023-06-24 19:36:59,169 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1168842.0, ans=0.1 2023-06-24 19:37:04,425 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1168902.0, ans=0.125 2023-06-24 19:37:05,812 INFO [train.py:996] (2/4) Epoch 7, batch 11850, loss[loss=0.1925, simple_loss=0.265, pruned_loss=0.05999, over 16128.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.3073, pruned_loss=0.07598, over 4269454.35 frames. ], batch size: 60, lr: 4.35e-03, grad_scale: 16.0 2023-06-24 19:38:48,494 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1169142.0, ans=0.0 2023-06-24 19:38:50,467 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1169142.0, ans=0.0 2023-06-24 19:38:54,732 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.54 vs. limit=15.0 2023-06-24 19:39:02,914 INFO [train.py:996] (2/4) Epoch 7, batch 11900, loss[loss=0.2119, simple_loss=0.3032, pruned_loss=0.06027, over 21581.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.3081, pruned_loss=0.07368, over 4275013.19 frames. ], batch size: 389, lr: 4.35e-03, grad_scale: 16.0 2023-06-24 19:39:23,240 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1169202.0, ans=0.1 2023-06-24 19:39:51,156 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.045e+02 2.709e+02 3.163e+02 3.879e+02 8.042e+02, threshold=6.325e+02, percent-clipped=1.0 2023-06-24 19:40:58,223 INFO [train.py:996] (2/4) Epoch 7, batch 11950, loss[loss=0.2097, simple_loss=0.307, pruned_loss=0.05618, over 21637.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.3096, pruned_loss=0.07109, over 4273088.30 frames. ], batch size: 441, lr: 4.35e-03, grad_scale: 16.0 2023-06-24 19:41:05,765 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1169502.0, ans=0.1 2023-06-24 19:41:28,037 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1169562.0, ans=0.125 2023-06-24 19:41:51,155 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.16 vs. limit=6.0 2023-06-24 19:41:59,867 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1169682.0, ans=0.09899494936611666 2023-06-24 19:42:17,388 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.14 vs. limit=15.0 2023-06-24 19:42:18,419 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1169742.0, ans=0.05 2023-06-24 19:42:40,557 INFO [train.py:996] (2/4) Epoch 7, batch 12000, loss[loss=0.2095, simple_loss=0.2757, pruned_loss=0.07166, over 21853.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.3015, pruned_loss=0.06905, over 4271656.65 frames. ], batch size: 107, lr: 4.34e-03, grad_scale: 32.0 2023-06-24 19:42:40,557 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-24 19:43:01,776 INFO [train.py:1028] (2/4) Epoch 7, validation: loss=0.261, simple_loss=0.3543, pruned_loss=0.08379, over 1796401.00 frames. 2023-06-24 19:43:01,778 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 23731MB 2023-06-24 19:43:20,688 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1169802.0, ans=0.125 2023-06-24 19:43:28,344 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1169862.0, ans=0.0 2023-06-24 19:43:44,293 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.005e+02 2.826e+02 3.232e+02 4.022e+02 5.951e+02, threshold=6.465e+02, percent-clipped=0.0 2023-06-24 19:44:11,051 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1169982.0, ans=0.0 2023-06-24 19:44:46,485 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1170042.0, ans=0.125 2023-06-24 19:44:56,198 INFO [train.py:996] (2/4) Epoch 7, batch 12050, loss[loss=0.2474, simple_loss=0.3454, pruned_loss=0.07476, over 19784.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.2977, pruned_loss=0.07067, over 4269430.25 frames. ], batch size: 702, lr: 4.34e-03, grad_scale: 16.0 2023-06-24 19:45:55,695 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1170282.0, ans=0.0 2023-06-24 19:46:34,166 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1170342.0, ans=0.0 2023-06-24 19:46:48,314 INFO [train.py:996] (2/4) Epoch 7, batch 12100, loss[loss=0.2556, simple_loss=0.336, pruned_loss=0.08759, over 21617.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.3017, pruned_loss=0.0747, over 4276496.54 frames. ], batch size: 389, lr: 4.34e-03, grad_scale: 16.0 2023-06-24 19:47:33,917 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.129e+02 3.018e+02 3.555e+02 4.988e+02 8.352e+02, threshold=7.110e+02, percent-clipped=5.0 2023-06-24 19:47:43,679 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1170522.0, ans=0.2 2023-06-24 19:47:51,363 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1170522.0, ans=0.0 2023-06-24 19:47:55,373 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1170582.0, ans=0.5 2023-06-24 19:47:55,439 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1170582.0, ans=0.2 2023-06-24 19:48:25,752 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1170642.0, ans=0.1 2023-06-24 19:48:28,928 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1170642.0, ans=0.125 2023-06-24 19:48:41,147 INFO [train.py:996] (2/4) Epoch 7, batch 12150, loss[loss=0.2137, simple_loss=0.2984, pruned_loss=0.0645, over 21542.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.3057, pruned_loss=0.07435, over 4272128.82 frames. ], batch size: 230, lr: 4.34e-03, grad_scale: 16.0 2023-06-24 19:48:41,744 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1170702.0, ans=0.2 2023-06-24 19:48:56,415 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1170702.0, ans=0.1 2023-06-24 19:48:57,979 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 19:49:12,146 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=6.14 vs. limit=22.5 2023-06-24 19:49:40,600 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1170822.0, ans=0.0 2023-06-24 19:49:51,471 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1170882.0, ans=0.125 2023-06-24 19:50:03,422 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1170882.0, ans=0.1 2023-06-24 19:50:23,862 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.97 vs. limit=15.0 2023-06-24 19:50:30,887 INFO [train.py:996] (2/4) Epoch 7, batch 12200, loss[loss=0.1854, simple_loss=0.2591, pruned_loss=0.05585, over 21293.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.3048, pruned_loss=0.07291, over 4266218.73 frames. ], batch size: 131, lr: 4.34e-03, grad_scale: 16.0 2023-06-24 19:51:25,419 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.212e+02 3.032e+02 3.828e+02 4.856e+02 1.056e+03, threshold=7.657e+02, percent-clipped=7.0 2023-06-24 19:51:39,661 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1171182.0, ans=10.0 2023-06-24 19:52:13,808 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.71 vs. limit=22.5 2023-06-24 19:52:18,162 INFO [train.py:996] (2/4) Epoch 7, batch 12250, loss[loss=0.1635, simple_loss=0.2353, pruned_loss=0.04587, over 21158.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2967, pruned_loss=0.06904, over 4273846.34 frames. ], batch size: 143, lr: 4.34e-03, grad_scale: 16.0 2023-06-24 19:53:31,934 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1171482.0, ans=0.125 2023-06-24 19:54:06,975 INFO [train.py:996] (2/4) Epoch 7, batch 12300, loss[loss=0.2476, simple_loss=0.3435, pruned_loss=0.07587, over 21636.00 frames. ], tot_loss[loss=0.209, simple_loss=0.2887, pruned_loss=0.0647, over 4267090.69 frames. ], batch size: 389, lr: 4.34e-03, grad_scale: 16.0 2023-06-24 19:54:56,074 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.637e+02 2.162e+02 2.543e+02 3.041e+02 6.823e+02, threshold=5.086e+02, percent-clipped=0.0 2023-06-24 19:55:31,908 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.89 vs. limit=6.0 2023-06-24 19:55:50,234 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1171842.0, ans=0.125 2023-06-24 19:55:54,581 INFO [train.py:996] (2/4) Epoch 7, batch 12350, loss[loss=0.2238, simple_loss=0.3009, pruned_loss=0.07338, over 21404.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2937, pruned_loss=0.06664, over 4273980.95 frames. ], batch size: 211, lr: 4.34e-03, grad_scale: 16.0 2023-06-24 19:56:57,347 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1172022.0, ans=0.0 2023-06-24 19:57:27,850 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.22 vs. limit=10.0 2023-06-24 19:57:35,310 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.80 vs. limit=15.0 2023-06-24 19:57:40,230 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.70 vs. limit=15.0 2023-06-24 19:57:42,433 INFO [train.py:996] (2/4) Epoch 7, batch 12400, loss[loss=0.2308, simple_loss=0.3003, pruned_loss=0.08067, over 21388.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.2952, pruned_loss=0.06919, over 4271091.86 frames. ], batch size: 176, lr: 4.34e-03, grad_scale: 32.0 2023-06-24 19:58:09,834 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1172262.0, ans=0.07 2023-06-24 19:58:22,510 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1172262.0, ans=0.125 2023-06-24 19:58:37,884 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.337e+02 2.786e+02 3.157e+02 3.873e+02 7.298e+02, threshold=6.314e+02, percent-clipped=10.0 2023-06-24 19:58:58,378 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1172382.0, ans=0.125 2023-06-24 19:59:28,204 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1172442.0, ans=0.0 2023-06-24 19:59:33,084 INFO [train.py:996] (2/4) Epoch 7, batch 12450, loss[loss=0.1702, simple_loss=0.2571, pruned_loss=0.04168, over 17064.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.2993, pruned_loss=0.07284, over 4270551.61 frames. ], batch size: 60, lr: 4.34e-03, grad_scale: 32.0 2023-06-24 19:59:33,802 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1172502.0, ans=0.125 2023-06-24 19:59:34,315 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.83 vs. limit=22.5 2023-06-24 20:00:21,885 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1172562.0, ans=0.125 2023-06-24 20:01:17,090 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.61 vs. limit=15.0 2023-06-24 20:01:21,394 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1172742.0, ans=0.0 2023-06-24 20:01:30,062 INFO [train.py:996] (2/4) Epoch 7, batch 12500, loss[loss=0.254, simple_loss=0.3506, pruned_loss=0.07872, over 21776.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.3106, pruned_loss=0.0762, over 4277529.71 frames. ], batch size: 282, lr: 4.34e-03, grad_scale: 32.0 2023-06-24 20:01:43,107 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1172802.0, ans=0.0 2023-06-24 20:01:53,849 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1172802.0, ans=0.0 2023-06-24 20:01:55,777 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.39 vs. limit=15.0 2023-06-24 20:02:24,571 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.454e+02 3.093e+02 3.470e+02 4.423e+02 7.018e+02, threshold=6.940e+02, percent-clipped=1.0 2023-06-24 20:03:19,642 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1173042.0, ans=0.0 2023-06-24 20:03:26,743 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1173042.0, ans=0.125 2023-06-24 20:03:30,079 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1173102.0, ans=0.125 2023-06-24 20:03:31,034 INFO [train.py:996] (2/4) Epoch 7, batch 12550, loss[loss=0.2543, simple_loss=0.3315, pruned_loss=0.08857, over 21843.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3133, pruned_loss=0.07727, over 4276253.24 frames. ], batch size: 118, lr: 4.34e-03, grad_scale: 32.0 2023-06-24 20:03:39,548 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.63 vs. limit=10.0 2023-06-24 20:04:04,920 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1173162.0, ans=0.125 2023-06-24 20:04:17,526 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.80 vs. limit=12.0 2023-06-24 20:04:40,387 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1173282.0, ans=0.125 2023-06-24 20:05:21,023 INFO [train.py:996] (2/4) Epoch 7, batch 12600, loss[loss=0.1587, simple_loss=0.2326, pruned_loss=0.04238, over 21860.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.312, pruned_loss=0.07497, over 4279862.24 frames. ], batch size: 98, lr: 4.34e-03, grad_scale: 32.0 2023-06-24 20:06:05,569 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.874e+02 2.821e+02 3.460e+02 4.531e+02 8.641e+02, threshold=6.920e+02, percent-clipped=2.0 2023-06-24 20:06:07,990 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1173522.0, ans=0.125 2023-06-24 20:06:59,312 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1173642.0, ans=0.125 2023-06-24 20:07:13,636 INFO [train.py:996] (2/4) Epoch 7, batch 12650, loss[loss=0.2129, simple_loss=0.2804, pruned_loss=0.07275, over 21774.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.3047, pruned_loss=0.07191, over 4282036.38 frames. ], batch size: 247, lr: 4.34e-03, grad_scale: 16.0 2023-06-24 20:07:30,955 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1173762.0, ans=0.1 2023-06-24 20:07:43,507 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.25 vs. limit=15.0 2023-06-24 20:08:15,179 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.23 vs. limit=15.0 2023-06-24 20:08:28,952 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1173882.0, ans=0.0 2023-06-24 20:08:40,386 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1173942.0, ans=0.0 2023-06-24 20:08:43,907 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1173942.0, ans=0.125 2023-06-24 20:09:02,179 INFO [train.py:996] (2/4) Epoch 7, batch 12700, loss[loss=0.2041, simple_loss=0.2852, pruned_loss=0.06152, over 21002.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.3037, pruned_loss=0.07379, over 4284378.64 frames. ], batch size: 608, lr: 4.34e-03, grad_scale: 16.0 2023-06-24 20:09:47,817 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.268e+02 2.796e+02 3.277e+02 3.938e+02 5.852e+02, threshold=6.553e+02, percent-clipped=0.0 2023-06-24 20:10:41,097 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1174242.0, ans=0.2 2023-06-24 20:10:44,922 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1174242.0, ans=0.1 2023-06-24 20:10:50,771 INFO [train.py:996] (2/4) Epoch 7, batch 12750, loss[loss=0.2397, simple_loss=0.3182, pruned_loss=0.08063, over 21843.00 frames. ], tot_loss[loss=0.2266, simple_loss=0.305, pruned_loss=0.07415, over 4284016.91 frames. ], batch size: 118, lr: 4.34e-03, grad_scale: 16.0 2023-06-24 20:11:26,550 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.41 vs. limit=6.0 2023-06-24 20:11:36,443 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1174422.0, ans=0.07 2023-06-24 20:12:36,183 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1174542.0, ans=0.1 2023-06-24 20:12:39,037 INFO [train.py:996] (2/4) Epoch 7, batch 12800, loss[loss=0.2326, simple_loss=0.301, pruned_loss=0.08211, over 20092.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.3043, pruned_loss=0.07454, over 4285312.52 frames. ], batch size: 704, lr: 4.34e-03, grad_scale: 32.0 2023-06-24 20:12:45,880 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.29 vs. limit=15.0 2023-06-24 20:12:55,889 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.97 vs. limit=15.0 2023-06-24 20:13:13,536 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1174662.0, ans=0.05 2023-06-24 20:13:17,197 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1174662.0, ans=0.2 2023-06-24 20:13:23,129 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.86 vs. limit=15.0 2023-06-24 20:13:25,312 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.095e+02 2.978e+02 3.549e+02 4.677e+02 8.571e+02, threshold=7.098e+02, percent-clipped=5.0 2023-06-24 20:14:09,578 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1174842.0, ans=0.125 2023-06-24 20:14:25,210 INFO [train.py:996] (2/4) Epoch 7, batch 12850, loss[loss=0.2481, simple_loss=0.3209, pruned_loss=0.0877, over 21349.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.3063, pruned_loss=0.07522, over 4285309.12 frames. ], batch size: 548, lr: 4.34e-03, grad_scale: 32.0 2023-06-24 20:15:17,273 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1175022.0, ans=0.0 2023-06-24 20:15:28,167 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1175022.0, ans=0.0 2023-06-24 20:16:16,222 INFO [train.py:996] (2/4) Epoch 7, batch 12900, loss[loss=0.2344, simple_loss=0.3274, pruned_loss=0.07077, over 21644.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.303, pruned_loss=0.07163, over 4279577.90 frames. ], batch size: 414, lr: 4.34e-03, grad_scale: 16.0 2023-06-24 20:16:41,605 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1175262.0, ans=0.125 2023-06-24 20:17:14,885 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.015e+02 2.556e+02 2.922e+02 3.625e+02 8.221e+02, threshold=5.845e+02, percent-clipped=4.0 2023-06-24 20:17:31,023 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1175382.0, ans=0.125 2023-06-24 20:18:05,567 INFO [train.py:996] (2/4) Epoch 7, batch 12950, loss[loss=0.1887, simple_loss=0.2673, pruned_loss=0.05506, over 21385.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.3031, pruned_loss=0.07093, over 4277834.19 frames. ], batch size: 194, lr: 4.33e-03, grad_scale: 16.0 2023-06-24 20:18:06,148 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1175502.0, ans=0.0 2023-06-24 20:18:08,383 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.18 vs. limit=22.5 2023-06-24 20:18:13,106 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1175502.0, ans=0.125 2023-06-24 20:19:24,995 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1175682.0, ans=0.0 2023-06-24 20:19:53,395 INFO [train.py:996] (2/4) Epoch 7, batch 13000, loss[loss=0.1329, simple_loss=0.2001, pruned_loss=0.0328, over 21814.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.3023, pruned_loss=0.07137, over 4280690.53 frames. ], batch size: 98, lr: 4.33e-03, grad_scale: 16.0 2023-06-24 20:20:50,813 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.916e+02 2.748e+02 3.242e+02 4.275e+02 7.846e+02, threshold=6.485e+02, percent-clipped=8.0 2023-06-24 20:21:12,917 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1175982.0, ans=0.125 2023-06-24 20:21:43,446 INFO [train.py:996] (2/4) Epoch 7, batch 13050, loss[loss=0.2026, simple_loss=0.2731, pruned_loss=0.06607, over 21850.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.2993, pruned_loss=0.07008, over 4283624.44 frames. ], batch size: 298, lr: 4.33e-03, grad_scale: 16.0 2023-06-24 20:22:40,084 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1176222.0, ans=0.125 2023-06-24 20:23:31,766 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1176402.0, ans=0.0 2023-06-24 20:23:32,696 INFO [train.py:996] (2/4) Epoch 7, batch 13100, loss[loss=0.2806, simple_loss=0.4116, pruned_loss=0.07481, over 19755.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.3016, pruned_loss=0.07048, over 4281106.09 frames. ], batch size: 702, lr: 4.33e-03, grad_scale: 16.0 2023-06-24 20:23:57,841 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.94 vs. limit=22.5 2023-06-24 20:24:01,063 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.56 vs. limit=6.0 2023-06-24 20:24:16,354 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1176462.0, ans=0.125 2023-06-24 20:24:21,102 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1176462.0, ans=0.0 2023-06-24 20:24:30,424 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1176522.0, ans=0.1 2023-06-24 20:24:31,409 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.070e+02 2.745e+02 3.057e+02 3.676e+02 6.184e+02, threshold=6.113e+02, percent-clipped=0.0 2023-06-24 20:25:20,694 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.40 vs. limit=15.0 2023-06-24 20:25:33,939 INFO [train.py:996] (2/4) Epoch 7, batch 13150, loss[loss=0.1812, simple_loss=0.2496, pruned_loss=0.05636, over 21290.00 frames. ], tot_loss[loss=0.225, simple_loss=0.304, pruned_loss=0.07296, over 4278826.40 frames. ], batch size: 159, lr: 4.33e-03, grad_scale: 16.0 2023-06-24 20:26:13,391 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1176822.0, ans=0.125 2023-06-24 20:26:14,063 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.81 vs. limit=15.0 2023-06-24 20:26:59,692 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1176942.0, ans=0.125 2023-06-24 20:27:28,461 INFO [train.py:996] (2/4) Epoch 7, batch 13200, loss[loss=0.2426, simple_loss=0.3136, pruned_loss=0.08577, over 21618.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.3033, pruned_loss=0.07307, over 4276792.68 frames. ], batch size: 230, lr: 4.33e-03, grad_scale: 32.0 2023-06-24 20:28:09,827 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1177122.0, ans=0.125 2023-06-24 20:28:17,667 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.491e+02 2.990e+02 3.679e+02 4.765e+02 8.248e+02, threshold=7.359e+02, percent-clipped=11.0 2023-06-24 20:29:08,905 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.34 vs. limit=15.0 2023-06-24 20:29:18,289 INFO [train.py:996] (2/4) Epoch 7, batch 13250, loss[loss=0.1947, simple_loss=0.2836, pruned_loss=0.05287, over 21638.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.3027, pruned_loss=0.07461, over 4273103.59 frames. ], batch size: 230, lr: 4.33e-03, grad_scale: 16.0 2023-06-24 20:29:26,577 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.08 vs. limit=15.0 2023-06-24 20:29:55,170 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1177362.0, ans=0.125 2023-06-24 20:30:05,611 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1177422.0, ans=0.07 2023-06-24 20:30:21,971 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1177482.0, ans=0.0 2023-06-24 20:30:36,597 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1177482.0, ans=0.125 2023-06-24 20:30:56,402 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.16 vs. limit=15.0 2023-06-24 20:30:56,477 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.68 vs. limit=22.5 2023-06-24 20:31:09,744 INFO [train.py:996] (2/4) Epoch 7, batch 13300, loss[loss=0.2479, simple_loss=0.3272, pruned_loss=0.08429, over 21776.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.3054, pruned_loss=0.07341, over 4272681.32 frames. ], batch size: 124, lr: 4.33e-03, grad_scale: 16.0 2023-06-24 20:31:44,021 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1177662.0, ans=0.05 2023-06-24 20:32:10,688 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.218e+02 2.867e+02 3.500e+02 4.353e+02 7.353e+02, threshold=7.001e+02, percent-clipped=0.0 2023-06-24 20:32:18,892 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.12 vs. limit=15.0 2023-06-24 20:32:37,695 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1177782.0, ans=0.0 2023-06-24 20:32:52,025 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1177842.0, ans=0.125 2023-06-24 20:33:00,261 INFO [train.py:996] (2/4) Epoch 7, batch 13350, loss[loss=0.2415, simple_loss=0.327, pruned_loss=0.07793, over 21799.00 frames. ], tot_loss[loss=0.23, simple_loss=0.309, pruned_loss=0.07555, over 4277036.37 frames. ], batch size: 282, lr: 4.33e-03, grad_scale: 16.0 2023-06-24 20:34:27,890 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.29 vs. limit=15.0 2023-06-24 20:34:48,882 INFO [train.py:996] (2/4) Epoch 7, batch 13400, loss[loss=0.2253, simple_loss=0.287, pruned_loss=0.08182, over 21417.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3104, pruned_loss=0.07702, over 4280733.32 frames. ], batch size: 211, lr: 4.33e-03, grad_scale: 16.0 2023-06-24 20:34:49,764 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1178202.0, ans=0.1 2023-06-24 20:35:02,464 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1178202.0, ans=0.125 2023-06-24 20:35:25,465 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1178262.0, ans=0.04949747468305833 2023-06-24 20:35:27,204 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1178262.0, ans=0.0 2023-06-24 20:35:54,573 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.196e+02 2.870e+02 3.236e+02 3.893e+02 7.079e+02, threshold=6.472e+02, percent-clipped=1.0 2023-06-24 20:36:00,048 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.49 vs. limit=15.0 2023-06-24 20:36:14,513 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1178382.0, ans=0.125 2023-06-24 20:36:35,252 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1178442.0, ans=0.05 2023-06-24 20:36:43,529 INFO [train.py:996] (2/4) Epoch 7, batch 13450, loss[loss=0.2052, simple_loss=0.2861, pruned_loss=0.06211, over 20693.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3105, pruned_loss=0.07915, over 4274460.26 frames. ], batch size: 607, lr: 4.33e-03, grad_scale: 16.0 2023-06-24 20:37:19,969 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1178562.0, ans=0.125 2023-06-24 20:37:42,684 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1178622.0, ans=0.0 2023-06-24 20:38:28,317 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1178742.0, ans=0.125 2023-06-24 20:38:33,355 INFO [train.py:996] (2/4) Epoch 7, batch 13500, loss[loss=0.2077, simple_loss=0.2926, pruned_loss=0.06136, over 21221.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3028, pruned_loss=0.07731, over 4273017.72 frames. ], batch size: 548, lr: 4.33e-03, grad_scale: 16.0 2023-06-24 20:39:20,010 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1178862.0, ans=0.1 2023-06-24 20:39:21,797 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1178862.0, ans=0.2 2023-06-24 20:39:33,175 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1178922.0, ans=0.0 2023-06-24 20:39:35,910 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.421e+02 3.362e+02 3.847e+02 4.790e+02 7.815e+02, threshold=7.695e+02, percent-clipped=4.0 2023-06-24 20:39:38,589 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.07 vs. limit=12.0 2023-06-24 20:39:39,784 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1178922.0, ans=0.125 2023-06-24 20:39:47,678 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.91 vs. limit=22.5 2023-06-24 20:40:09,891 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1179042.0, ans=0.0 2023-06-24 20:40:30,488 INFO [train.py:996] (2/4) Epoch 7, batch 13550, loss[loss=0.2434, simple_loss=0.3317, pruned_loss=0.07753, over 21268.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.3076, pruned_loss=0.07674, over 4274525.14 frames. ], batch size: 176, lr: 4.33e-03, grad_scale: 16.0 2023-06-24 20:40:39,944 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1179102.0, ans=0.125 2023-06-24 20:41:01,105 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.19 vs. limit=22.5 2023-06-24 20:41:06,971 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.77 vs. limit=15.0 2023-06-24 20:42:10,914 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1179342.0, ans=0.2 2023-06-24 20:42:19,508 INFO [train.py:996] (2/4) Epoch 7, batch 13600, loss[loss=0.2396, simple_loss=0.3073, pruned_loss=0.08597, over 21581.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.3085, pruned_loss=0.07706, over 4274330.36 frames. ], batch size: 548, lr: 4.33e-03, grad_scale: 32.0 2023-06-24 20:42:45,046 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 20:43:04,682 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1179522.0, ans=0.2 2023-06-24 20:43:13,907 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.974e+02 2.752e+02 3.319e+02 4.170e+02 8.424e+02, threshold=6.637e+02, percent-clipped=2.0 2023-06-24 20:43:16,640 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1179522.0, ans=0.2 2023-06-24 20:43:20,717 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.74 vs. limit=15.0 2023-06-24 20:43:35,677 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1179582.0, ans=0.125 2023-06-24 20:43:41,100 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1179582.0, ans=0.125 2023-06-24 20:44:13,963 INFO [train.py:996] (2/4) Epoch 7, batch 13650, loss[loss=0.198, simple_loss=0.2651, pruned_loss=0.06541, over 21876.00 frames. ], tot_loss[loss=0.2266, simple_loss=0.3046, pruned_loss=0.07437, over 4266838.63 frames. ], batch size: 107, lr: 4.33e-03, grad_scale: 32.0 2023-06-24 20:44:38,597 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1179762.0, ans=0.125 2023-06-24 20:45:09,987 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1179882.0, ans=0.0 2023-06-24 20:46:02,772 INFO [train.py:996] (2/4) Epoch 7, batch 13700, loss[loss=0.2081, simple_loss=0.2819, pruned_loss=0.06715, over 21631.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.2974, pruned_loss=0.07355, over 4258329.45 frames. ], batch size: 263, lr: 4.33e-03, grad_scale: 16.0 2023-06-24 20:46:08,483 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1180002.0, ans=0.125 2023-06-24 20:46:40,369 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1180062.0, ans=0.2 2023-06-24 20:46:53,961 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.072e+02 2.933e+02 3.408e+02 4.386e+02 8.480e+02, threshold=6.816e+02, percent-clipped=3.0 2023-06-24 20:47:10,571 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1180182.0, ans=0.0 2023-06-24 20:47:24,965 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1180182.0, ans=0.0 2023-06-24 20:47:32,189 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1180242.0, ans=0.0 2023-06-24 20:47:49,004 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1180242.0, ans=0.04949747468305833 2023-06-24 20:47:58,408 INFO [train.py:996] (2/4) Epoch 7, batch 13750, loss[loss=0.1674, simple_loss=0.2209, pruned_loss=0.05693, over 21790.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.2937, pruned_loss=0.07254, over 4264356.33 frames. ], batch size: 102, lr: 4.33e-03, grad_scale: 16.0 2023-06-24 20:48:08,662 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1180302.0, ans=0.125 2023-06-24 20:48:38,861 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1180422.0, ans=0.1 2023-06-24 20:49:22,762 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1180482.0, ans=0.0 2023-06-24 20:49:39,889 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1180542.0, ans=0.1 2023-06-24 20:49:45,416 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1180542.0, ans=0.1 2023-06-24 20:49:51,697 INFO [train.py:996] (2/4) Epoch 7, batch 13800, loss[loss=0.2188, simple_loss=0.2956, pruned_loss=0.07098, over 21134.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.3025, pruned_loss=0.07205, over 4259708.66 frames. ], batch size: 607, lr: 4.33e-03, grad_scale: 16.0 2023-06-24 20:50:16,348 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.53 vs. limit=12.0 2023-06-24 20:50:55,004 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.264e+02 2.819e+02 3.661e+02 5.277e+02 1.106e+03, threshold=7.321e+02, percent-clipped=8.0 2023-06-24 20:51:07,886 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1180782.0, ans=0.125 2023-06-24 20:51:25,228 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1180842.0, ans=0.125 2023-06-24 20:51:42,293 INFO [train.py:996] (2/4) Epoch 7, batch 13850, loss[loss=0.2881, simple_loss=0.3625, pruned_loss=0.1068, over 21722.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.3068, pruned_loss=0.07301, over 4262201.95 frames. ], batch size: 441, lr: 4.32e-03, grad_scale: 16.0 2023-06-24 20:52:29,008 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1181022.0, ans=0.1 2023-06-24 20:52:31,387 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.39 vs. limit=15.0 2023-06-24 20:52:46,230 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1181022.0, ans=0.05 2023-06-24 20:52:49,562 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1181022.0, ans=0.125 2023-06-24 20:52:58,537 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1181082.0, ans=0.125 2023-06-24 20:53:14,443 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1181142.0, ans=0.1 2023-06-24 20:53:33,227 INFO [train.py:996] (2/4) Epoch 7, batch 13900, loss[loss=0.2166, simple_loss=0.2895, pruned_loss=0.07186, over 21806.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.31, pruned_loss=0.07559, over 4261567.47 frames. ], batch size: 298, lr: 4.32e-03, grad_scale: 16.0 2023-06-24 20:53:34,262 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=6.37 vs. limit=22.5 2023-06-24 20:53:56,872 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1181262.0, ans=0.2 2023-06-24 20:54:28,615 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1181322.0, ans=0.2 2023-06-24 20:54:34,867 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.286e+02 3.148e+02 3.792e+02 4.891e+02 9.530e+02, threshold=7.583e+02, percent-clipped=4.0 2023-06-24 20:54:42,208 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1181322.0, ans=0.125 2023-06-24 20:55:08,644 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.94 vs. limit=15.0 2023-06-24 20:55:22,196 INFO [train.py:996] (2/4) Epoch 7, batch 13950, loss[loss=0.2286, simple_loss=0.2976, pruned_loss=0.07982, over 21688.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3111, pruned_loss=0.07723, over 4272860.59 frames. ], batch size: 263, lr: 4.32e-03, grad_scale: 16.0 2023-06-24 20:55:56,918 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.24 vs. limit=6.0 2023-06-24 20:56:13,279 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1181622.0, ans=0.125 2023-06-24 20:56:35,714 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1181682.0, ans=0.1 2023-06-24 20:56:42,928 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.23 vs. limit=15.0 2023-06-24 20:56:53,942 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1181742.0, ans=0.125 2023-06-24 20:57:06,389 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1181742.0, ans=0.0 2023-06-24 20:57:09,154 INFO [train.py:996] (2/4) Epoch 7, batch 14000, loss[loss=0.2048, simple_loss=0.3059, pruned_loss=0.05188, over 21597.00 frames. ], tot_loss[loss=0.229, simple_loss=0.3078, pruned_loss=0.07511, over 4258083.08 frames. ], batch size: 230, lr: 4.32e-03, grad_scale: 32.0 2023-06-24 20:57:39,294 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1181862.0, ans=0.125 2023-06-24 20:58:01,755 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1181922.0, ans=0.2 2023-06-24 20:58:14,896 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.195e+02 2.939e+02 3.299e+02 3.866e+02 1.368e+03, threshold=6.598e+02, percent-clipped=4.0 2023-06-24 20:58:27,211 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=1181982.0, ans=0.05 2023-06-24 20:58:56,619 INFO [train.py:996] (2/4) Epoch 7, batch 14050, loss[loss=0.1886, simple_loss=0.2492, pruned_loss=0.064, over 21480.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.3022, pruned_loss=0.07166, over 4264296.39 frames. ], batch size: 195, lr: 4.32e-03, grad_scale: 32.0 2023-06-24 21:00:00,042 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1182222.0, ans=0.09899494936611666 2023-06-24 21:00:20,675 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1182282.0, ans=0.0 2023-06-24 21:00:43,825 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1182402.0, ans=0.035 2023-06-24 21:00:43,851 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1182402.0, ans=0.125 2023-06-24 21:00:44,831 INFO [train.py:996] (2/4) Epoch 7, batch 14100, loss[loss=0.2235, simple_loss=0.2972, pruned_loss=0.07489, over 21466.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.2959, pruned_loss=0.07142, over 4264708.39 frames. ], batch size: 194, lr: 4.32e-03, grad_scale: 32.0 2023-06-24 21:01:47,764 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1182522.0, ans=0.0 2023-06-24 21:01:49,386 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1182522.0, ans=0.125 2023-06-24 21:01:52,078 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.852e+02 2.658e+02 3.185e+02 3.657e+02 7.559e+02, threshold=6.369e+02, percent-clipped=1.0 2023-06-24 21:02:18,974 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1182642.0, ans=0.125 2023-06-24 21:02:28,801 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1182702.0, ans=0.0 2023-06-24 21:02:29,774 INFO [train.py:996] (2/4) Epoch 7, batch 14150, loss[loss=0.2148, simple_loss=0.3049, pruned_loss=0.06232, over 21771.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.2991, pruned_loss=0.07273, over 4270829.23 frames. ], batch size: 118, lr: 4.32e-03, grad_scale: 16.0 2023-06-24 21:02:33,643 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1182702.0, ans=0.125 2023-06-24 21:02:40,538 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1182702.0, ans=0.1 2023-06-24 21:02:59,853 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.90 vs. limit=10.0 2023-06-24 21:03:57,014 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=6.54 vs. limit=22.5 2023-06-24 21:04:14,284 INFO [train.py:996] (2/4) Epoch 7, batch 14200, loss[loss=0.208, simple_loss=0.2779, pruned_loss=0.06906, over 21654.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.2985, pruned_loss=0.07123, over 4259606.47 frames. ], batch size: 332, lr: 4.32e-03, grad_scale: 16.0 2023-06-24 21:04:15,306 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1183002.0, ans=0.125 2023-06-24 21:04:33,793 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.64 vs. limit=15.0 2023-06-24 21:05:17,833 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.744e+02 2.638e+02 3.052e+02 3.885e+02 7.622e+02, threshold=6.105e+02, percent-clipped=2.0 2023-06-24 21:05:35,834 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1183182.0, ans=0.125 2023-06-24 21:06:03,296 INFO [train.py:996] (2/4) Epoch 7, batch 14250, loss[loss=0.2308, simple_loss=0.302, pruned_loss=0.07982, over 21505.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.2932, pruned_loss=0.07113, over 4255904.65 frames. ], batch size: 509, lr: 4.32e-03, grad_scale: 8.0 2023-06-24 21:06:33,775 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1183362.0, ans=0.0 2023-06-24 21:06:39,432 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1183362.0, ans=0.125 2023-06-24 21:07:33,664 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1183542.0, ans=0.1 2023-06-24 21:07:45,896 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1183542.0, ans=0.125 2023-06-24 21:07:51,570 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1183602.0, ans=0.0 2023-06-24 21:07:51,582 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1183602.0, ans=0.125 2023-06-24 21:07:52,514 INFO [train.py:996] (2/4) Epoch 7, batch 14300, loss[loss=0.1657, simple_loss=0.2564, pruned_loss=0.03752, over 21165.00 frames. ], tot_loss[loss=0.2172, simple_loss=0.2936, pruned_loss=0.0704, over 4246130.76 frames. ], batch size: 548, lr: 4.32e-03, grad_scale: 8.0 2023-06-24 21:08:56,719 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.111e+02 2.824e+02 3.398e+02 4.914e+02 1.429e+03, threshold=6.796e+02, percent-clipped=17.0 2023-06-24 21:09:10,330 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_na.min_abs, batch_count=1183782.0, ans=0.02 2023-06-24 21:09:25,444 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1183842.0, ans=0.125 2023-06-24 21:09:32,184 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 21:09:40,495 INFO [train.py:996] (2/4) Epoch 7, batch 14350, loss[loss=0.2297, simple_loss=0.3125, pruned_loss=0.07342, over 21769.00 frames. ], tot_loss[loss=0.2181, simple_loss=0.2962, pruned_loss=0.07003, over 4229348.37 frames. ], batch size: 441, lr: 4.32e-03, grad_scale: 8.0 2023-06-24 21:11:25,667 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1184142.0, ans=0.0 2023-06-24 21:11:28,492 INFO [train.py:996] (2/4) Epoch 7, batch 14400, loss[loss=0.2322, simple_loss=0.2912, pruned_loss=0.08659, over 21998.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.2956, pruned_loss=0.07116, over 4243753.23 frames. ], batch size: 103, lr: 4.32e-03, grad_scale: 16.0 2023-06-24 21:11:43,052 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1184202.0, ans=10.0 2023-06-24 21:12:04,680 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1184262.0, ans=0.125 2023-06-24 21:12:06,633 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1184322.0, ans=0.125 2023-06-24 21:12:13,550 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.86 vs. limit=10.0 2023-06-24 21:12:32,413 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.148e+02 2.835e+02 3.374e+02 4.163e+02 7.231e+02, threshold=6.749e+02, percent-clipped=2.0 2023-06-24 21:12:51,669 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1184382.0, ans=0.1 2023-06-24 21:13:14,850 INFO [train.py:996] (2/4) Epoch 7, batch 14450, loss[loss=0.1976, simple_loss=0.2622, pruned_loss=0.06646, over 21829.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.2915, pruned_loss=0.07177, over 4256995.10 frames. ], batch size: 283, lr: 4.32e-03, grad_scale: 16.0 2023-06-24 21:13:27,511 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1184502.0, ans=0.125 2023-06-24 21:14:37,905 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.21 vs. limit=12.0 2023-06-24 21:14:58,800 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.39 vs. limit=12.0 2023-06-24 21:15:03,209 INFO [train.py:996] (2/4) Epoch 7, batch 14500, loss[loss=0.239, simple_loss=0.3171, pruned_loss=0.08044, over 21553.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2896, pruned_loss=0.07168, over 4266237.51 frames. ], batch size: 441, lr: 4.32e-03, grad_scale: 16.0 2023-06-24 21:16:08,442 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.052e+02 2.747e+02 3.186e+02 4.190e+02 7.871e+02, threshold=6.373e+02, percent-clipped=3.0 2023-06-24 21:16:47,453 INFO [train.py:996] (2/4) Epoch 7, batch 14550, loss[loss=0.2459, simple_loss=0.3249, pruned_loss=0.08347, over 21388.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.2951, pruned_loss=0.07276, over 4269669.88 frames. ], batch size: 176, lr: 4.32e-03, grad_scale: 16.0 2023-06-24 21:18:34,813 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1185342.0, ans=0.2 2023-06-24 21:18:37,540 INFO [train.py:996] (2/4) Epoch 7, batch 14600, loss[loss=0.2011, simple_loss=0.296, pruned_loss=0.05307, over 19706.00 frames. ], tot_loss[loss=0.228, simple_loss=0.3024, pruned_loss=0.07677, over 4271888.68 frames. ], batch size: 702, lr: 4.32e-03, grad_scale: 16.0 2023-06-24 21:18:45,295 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1185402.0, ans=0.0 2023-06-24 21:19:04,841 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.09 vs. limit=6.0 2023-06-24 21:19:25,169 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1185462.0, ans=0.0 2023-06-24 21:19:27,263 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1185522.0, ans=0.1 2023-06-24 21:19:42,438 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.336e+02 3.108e+02 3.903e+02 5.552e+02 1.166e+03, threshold=7.806e+02, percent-clipped=17.0 2023-06-24 21:20:20,974 INFO [train.py:996] (2/4) Epoch 7, batch 14650, loss[loss=0.2578, simple_loss=0.3559, pruned_loss=0.07984, over 21209.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.3045, pruned_loss=0.07607, over 4262825.43 frames. ], batch size: 548, lr: 4.32e-03, grad_scale: 16.0 2023-06-24 21:21:06,109 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.54 vs. limit=12.0 2023-06-24 21:21:31,576 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1185882.0, ans=0.125 2023-06-24 21:21:51,990 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1185942.0, ans=0.125 2023-06-24 21:21:56,153 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=7.82 vs. limit=15.0 2023-06-24 21:22:00,327 INFO [train.py:996] (2/4) Epoch 7, batch 14700, loss[loss=0.2035, simple_loss=0.2703, pruned_loss=0.06831, over 16287.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.2982, pruned_loss=0.07066, over 4252059.79 frames. ], batch size: 61, lr: 4.32e-03, grad_scale: 16.0 2023-06-24 21:22:17,539 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.98 vs. limit=22.5 2023-06-24 21:23:06,196 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.798e+02 2.369e+02 2.874e+02 3.417e+02 6.463e+02, threshold=5.748e+02, percent-clipped=0.0 2023-06-24 21:23:09,000 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1186182.0, ans=0.125 2023-06-24 21:23:19,999 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1186182.0, ans=0.1 2023-06-24 21:23:37,886 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1186242.0, ans=0.125 2023-06-24 21:23:51,809 INFO [train.py:996] (2/4) Epoch 7, batch 14750, loss[loss=0.2294, simple_loss=0.3133, pruned_loss=0.07272, over 20974.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.3036, pruned_loss=0.07277, over 4253552.05 frames. ], batch size: 608, lr: 4.31e-03, grad_scale: 16.0 2023-06-24 21:24:19,183 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.16 vs. limit=15.0 2023-06-24 21:24:20,306 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1186362.0, ans=0.125 2023-06-24 21:25:03,183 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1186482.0, ans=0.2 2023-06-24 21:25:48,302 INFO [train.py:996] (2/4) Epoch 7, batch 14800, loss[loss=0.2059, simple_loss=0.276, pruned_loss=0.06785, over 21831.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3146, pruned_loss=0.07739, over 4259312.74 frames. ], batch size: 118, lr: 4.31e-03, grad_scale: 32.0 2023-06-24 21:25:55,854 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1186602.0, ans=0.125 2023-06-24 21:26:00,119 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1186602.0, ans=0.0 2023-06-24 21:26:00,618 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.81 vs. limit=15.0 2023-06-24 21:26:06,863 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1186602.0, ans=0.0 2023-06-24 21:26:18,597 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.36 vs. limit=15.0 2023-06-24 21:26:43,081 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1186722.0, ans=0.125 2023-06-24 21:26:44,096 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.322e+02 3.322e+02 4.309e+02 5.612e+02 1.041e+03, threshold=8.619e+02, percent-clipped=22.0 2023-06-24 21:26:46,335 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1186782.0, ans=0.1 2023-06-24 21:26:47,955 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1186782.0, ans=0.0 2023-06-24 21:27:44,155 INFO [train.py:996] (2/4) Epoch 7, batch 14850, loss[loss=0.2087, simple_loss=0.288, pruned_loss=0.06473, over 20072.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3079, pruned_loss=0.07674, over 4259092.29 frames. ], batch size: 704, lr: 4.31e-03, grad_scale: 32.0 2023-06-24 21:28:12,383 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1186962.0, ans=0.125 2023-06-24 21:29:30,046 INFO [train.py:996] (2/4) Epoch 7, batch 14900, loss[loss=0.252, simple_loss=0.3223, pruned_loss=0.09085, over 21567.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3107, pruned_loss=0.07918, over 4262697.00 frames. ], batch size: 230, lr: 4.31e-03, grad_scale: 16.0 2023-06-24 21:30:12,551 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1187322.0, ans=0.2 2023-06-24 21:30:36,829 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.129e+02 3.164e+02 3.884e+02 4.869e+02 8.267e+02, threshold=7.767e+02, percent-clipped=0.0 2023-06-24 21:30:57,295 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1187382.0, ans=0.2 2023-06-24 21:30:59,512 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 21:31:10,200 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1187442.0, ans=0.05 2023-06-24 21:31:13,941 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1187442.0, ans=0.0 2023-06-24 21:31:20,160 INFO [train.py:996] (2/4) Epoch 7, batch 14950, loss[loss=0.2455, simple_loss=0.329, pruned_loss=0.08104, over 21419.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3105, pruned_loss=0.07809, over 4260321.66 frames. ], batch size: 131, lr: 4.31e-03, grad_scale: 16.0 2023-06-24 21:31:42,418 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.90 vs. limit=12.0 2023-06-24 21:32:50,643 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1187742.0, ans=0.1 2023-06-24 21:33:09,322 INFO [train.py:996] (2/4) Epoch 7, batch 15000, loss[loss=0.2197, simple_loss=0.29, pruned_loss=0.0747, over 19988.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3125, pruned_loss=0.07908, over 4259719.56 frames. ], batch size: 702, lr: 4.31e-03, grad_scale: 16.0 2023-06-24 21:33:09,323 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-24 21:33:26,474 INFO [train.py:1028] (2/4) Epoch 7, validation: loss=0.2547, simple_loss=0.3504, pruned_loss=0.07951, over 1796401.00 frames. 2023-06-24 21:33:26,476 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 23731MB 2023-06-24 21:34:39,135 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.086e+02 2.767e+02 3.159e+02 3.696e+02 5.819e+02, threshold=6.318e+02, percent-clipped=0.0 2023-06-24 21:34:52,558 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1187982.0, ans=0.125 2023-06-24 21:35:12,730 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1188042.0, ans=0.0 2023-06-24 21:35:17,630 INFO [train.py:996] (2/4) Epoch 7, batch 15050, loss[loss=0.2984, simple_loss=0.3878, pruned_loss=0.1045, over 21578.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3144, pruned_loss=0.08082, over 4256706.59 frames. ], batch size: 471, lr: 4.31e-03, grad_scale: 16.0 2023-06-24 21:35:23,898 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.59 vs. limit=15.0 2023-06-24 21:35:56,703 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1188162.0, ans=0.0 2023-06-24 21:36:00,232 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1188162.0, ans=0.125 2023-06-24 21:36:05,018 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.86 vs. limit=15.0 2023-06-24 21:36:14,353 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1188222.0, ans=0.2 2023-06-24 21:36:22,967 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1188222.0, ans=0.1 2023-06-24 21:36:23,683 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.92 vs. limit=15.0 2023-06-24 21:36:28,483 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1188282.0, ans=0.125 2023-06-24 21:37:07,761 INFO [train.py:996] (2/4) Epoch 7, batch 15100, loss[loss=0.2873, simple_loss=0.3566, pruned_loss=0.109, over 21438.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3163, pruned_loss=0.08068, over 4262888.46 frames. ], batch size: 471, lr: 4.31e-03, grad_scale: 16.0 2023-06-24 21:37:40,247 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.21 vs. limit=15.0 2023-06-24 21:38:13,590 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.271e+02 2.973e+02 3.589e+02 4.717e+02 7.835e+02, threshold=7.177e+02, percent-clipped=5.0 2023-06-24 21:38:18,216 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1188582.0, ans=0.0 2023-06-24 21:38:33,298 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1188642.0, ans=0.125 2023-06-24 21:39:00,096 INFO [train.py:996] (2/4) Epoch 7, batch 15150, loss[loss=0.2321, simple_loss=0.2867, pruned_loss=0.08872, over 21427.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.3117, pruned_loss=0.08067, over 4271246.19 frames. ], batch size: 389, lr: 4.31e-03, grad_scale: 16.0 2023-06-24 21:39:00,746 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1188702.0, ans=0.1 2023-06-24 21:39:34,435 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1188762.0, ans=0.0 2023-06-24 21:40:02,753 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1188882.0, ans=0.1 2023-06-24 21:40:04,486 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1188882.0, ans=0.0 2023-06-24 21:40:40,219 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1188942.0, ans=0.125 2023-06-24 21:40:49,638 INFO [train.py:996] (2/4) Epoch 7, batch 15200, loss[loss=0.1918, simple_loss=0.2834, pruned_loss=0.05004, over 21853.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.3037, pruned_loss=0.07709, over 4266716.59 frames. ], batch size: 372, lr: 4.31e-03, grad_scale: 32.0 2023-06-24 21:41:08,773 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1189002.0, ans=0.125 2023-06-24 21:41:29,239 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1189062.0, ans=0.125 2023-06-24 21:41:51,480 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.032e+02 2.555e+02 2.882e+02 3.442e+02 5.882e+02, threshold=5.763e+02, percent-clipped=0.0 2023-06-24 21:42:03,783 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.29 vs. limit=22.5 2023-06-24 21:42:49,814 INFO [train.py:996] (2/4) Epoch 7, batch 15250, loss[loss=0.2301, simple_loss=0.2798, pruned_loss=0.09023, over 21317.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.2976, pruned_loss=0.07563, over 4262467.12 frames. ], batch size: 473, lr: 4.31e-03, grad_scale: 32.0 2023-06-24 21:42:50,516 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1189302.0, ans=0.2 2023-06-24 21:43:46,789 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1189482.0, ans=0.1 2023-06-24 21:43:50,927 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.58 vs. limit=15.0 2023-06-24 21:44:05,927 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 21:44:23,653 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1189542.0, ans=0.0 2023-06-24 21:44:39,061 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1189602.0, ans=0.0 2023-06-24 21:44:40,026 INFO [train.py:996] (2/4) Epoch 7, batch 15300, loss[loss=0.2763, simple_loss=0.3429, pruned_loss=0.1049, over 21442.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.3013, pruned_loss=0.07848, over 4255899.28 frames. ], batch size: 471, lr: 4.31e-03, grad_scale: 16.0 2023-06-24 21:44:40,694 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1189602.0, ans=0.125 2023-06-24 21:44:47,830 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.09 vs. limit=15.0 2023-06-24 21:45:37,484 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.380e+02 3.236e+02 3.827e+02 4.813e+02 8.149e+02, threshold=7.653e+02, percent-clipped=14.0 2023-06-24 21:45:52,963 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.27 vs. limit=15.0 2023-06-24 21:46:11,216 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1189842.0, ans=0.0 2023-06-24 21:46:23,703 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1189842.0, ans=0.125 2023-06-24 21:46:27,876 INFO [train.py:996] (2/4) Epoch 7, batch 15350, loss[loss=0.2415, simple_loss=0.3269, pruned_loss=0.07801, over 21488.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3062, pruned_loss=0.0802, over 4262623.35 frames. ], batch size: 211, lr: 4.31e-03, grad_scale: 16.0 2023-06-24 21:46:29,139 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.53 vs. limit=15.0 2023-06-24 21:46:35,444 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1189902.0, ans=0.0 2023-06-24 21:47:07,859 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1190022.0, ans=0.125 2023-06-24 21:47:09,453 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1190022.0, ans=0.125 2023-06-24 21:47:09,516 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1190022.0, ans=0.125 2023-06-24 21:47:56,161 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1190142.0, ans=0.125 2023-06-24 21:48:14,128 INFO [train.py:996] (2/4) Epoch 7, batch 15400, loss[loss=0.2263, simple_loss=0.3074, pruned_loss=0.07264, over 21843.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3065, pruned_loss=0.07856, over 4275339.62 frames. ], batch size: 124, lr: 4.31e-03, grad_scale: 16.0 2023-06-24 21:48:50,919 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.72 vs. limit=15.0 2023-06-24 21:48:57,163 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1190322.0, ans=0.0 2023-06-24 21:49:05,380 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.161e+02 2.624e+02 3.015e+02 3.662e+02 6.507e+02, threshold=6.030e+02, percent-clipped=0.0 2023-06-24 21:49:52,769 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1190442.0, ans=0.1 2023-06-24 21:50:02,501 INFO [train.py:996] (2/4) Epoch 7, batch 15450, loss[loss=0.2399, simple_loss=0.324, pruned_loss=0.07789, over 21758.00 frames. ], tot_loss[loss=0.23, simple_loss=0.3046, pruned_loss=0.07775, over 4274532.94 frames. ], batch size: 414, lr: 4.31e-03, grad_scale: 16.0 2023-06-24 21:50:24,088 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=1190562.0, ans=15.0 2023-06-24 21:50:40,079 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.54 vs. limit=15.0 2023-06-24 21:51:46,574 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.56 vs. limit=22.5 2023-06-24 21:51:49,841 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1190742.0, ans=0.1 2023-06-24 21:51:52,855 INFO [train.py:996] (2/4) Epoch 7, batch 15500, loss[loss=0.2411, simple_loss=0.3225, pruned_loss=0.07983, over 21588.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3073, pruned_loss=0.07708, over 4252614.67 frames. ], batch size: 263, lr: 4.31e-03, grad_scale: 16.0 2023-06-24 21:52:19,099 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.77 vs. limit=10.0 2023-06-24 21:52:36,860 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1190922.0, ans=0.0 2023-06-24 21:52:51,803 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.076e+02 2.878e+02 3.263e+02 4.056e+02 7.756e+02, threshold=6.526e+02, percent-clipped=2.0 2023-06-24 21:53:01,657 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.44 vs. limit=22.5 2023-06-24 21:53:29,389 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.81 vs. limit=15.0 2023-06-24 21:53:37,053 INFO [train.py:996] (2/4) Epoch 7, batch 15550, loss[loss=0.1948, simple_loss=0.2789, pruned_loss=0.05534, over 21712.00 frames. ], tot_loss[loss=0.227, simple_loss=0.3057, pruned_loss=0.0741, over 4253662.77 frames. ], batch size: 247, lr: 4.31e-03, grad_scale: 16.0 2023-06-24 21:53:48,016 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1191102.0, ans=0.0 2023-06-24 21:55:20,515 INFO [train.py:996] (2/4) Epoch 7, batch 15600, loss[loss=0.1951, simple_loss=0.2574, pruned_loss=0.06636, over 21410.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.3019, pruned_loss=0.07278, over 4243742.29 frames. ], batch size: 212, lr: 4.31e-03, grad_scale: 32.0 2023-06-24 21:55:28,630 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1191402.0, ans=0.1 2023-06-24 21:56:21,705 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.09 vs. limit=15.0 2023-06-24 21:56:23,751 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.070e+02 2.726e+02 3.210e+02 4.134e+02 7.598e+02, threshold=6.420e+02, percent-clipped=3.0 2023-06-24 21:56:29,601 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1191582.0, ans=0.0 2023-06-24 21:57:09,398 INFO [train.py:996] (2/4) Epoch 7, batch 15650, loss[loss=0.2067, simple_loss=0.2908, pruned_loss=0.0613, over 21590.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.3005, pruned_loss=0.07221, over 4234841.72 frames. ], batch size: 263, lr: 4.30e-03, grad_scale: 32.0 2023-06-24 21:57:24,303 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1191702.0, ans=0.1 2023-06-24 21:57:32,631 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1191762.0, ans=0.0 2023-06-24 21:58:30,478 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.43 vs. limit=15.0 2023-06-24 21:58:36,694 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1191882.0, ans=0.1 2023-06-24 21:58:42,237 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1191942.0, ans=0.1 2023-06-24 21:58:57,019 INFO [train.py:996] (2/4) Epoch 7, batch 15700, loss[loss=0.2078, simple_loss=0.2967, pruned_loss=0.05939, over 21860.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.2966, pruned_loss=0.07106, over 4236604.92 frames. ], batch size: 372, lr: 4.30e-03, grad_scale: 32.0 2023-06-24 21:59:29,473 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1192122.0, ans=0.1 2023-06-24 21:59:55,686 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1192122.0, ans=0.2 2023-06-24 21:59:59,346 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1192182.0, ans=0.125 2023-06-24 22:00:00,281 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.170e+02 2.614e+02 3.168e+02 3.646e+02 5.632e+02, threshold=6.336e+02, percent-clipped=0.0 2023-06-24 22:00:29,140 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.52 vs. limit=22.5 2023-06-24 22:00:32,274 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.56 vs. limit=15.0 2023-06-24 22:00:43,478 INFO [train.py:996] (2/4) Epoch 7, batch 15750, loss[loss=0.2144, simple_loss=0.2853, pruned_loss=0.07174, over 21691.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2918, pruned_loss=0.07105, over 4245171.25 frames. ], batch size: 282, lr: 4.30e-03, grad_scale: 16.0 2023-06-24 22:00:59,371 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1192362.0, ans=0.125 2023-06-24 22:01:10,715 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.08 vs. limit=12.0 2023-06-24 22:01:26,383 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1192422.0, ans=0.04949747468305833 2023-06-24 22:01:43,873 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1192422.0, ans=0.1 2023-06-24 22:02:11,180 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.82 vs. limit=15.0 2023-06-24 22:02:24,769 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1192542.0, ans=0.125 2023-06-24 22:02:32,345 INFO [train.py:996] (2/4) Epoch 7, batch 15800, loss[loss=0.2169, simple_loss=0.2859, pruned_loss=0.07395, over 21745.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2869, pruned_loss=0.07088, over 4256002.55 frames. ], batch size: 351, lr: 4.30e-03, grad_scale: 16.0 2023-06-24 22:02:43,259 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1192602.0, ans=0.125 2023-06-24 22:03:37,344 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.306e+02 2.697e+02 3.086e+02 3.699e+02 6.270e+02, threshold=6.172e+02, percent-clipped=0.0 2023-06-24 22:03:49,224 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1192782.0, ans=0.125 2023-06-24 22:03:51,633 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.46 vs. limit=22.5 2023-06-24 22:04:15,589 INFO [train.py:996] (2/4) Epoch 7, batch 15850, loss[loss=0.2266, simple_loss=0.2957, pruned_loss=0.07875, over 21902.00 frames. ], tot_loss[loss=0.2167, simple_loss=0.2876, pruned_loss=0.07289, over 4260758.16 frames. ], batch size: 317, lr: 4.30e-03, grad_scale: 16.0 2023-06-24 22:04:36,992 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1192962.0, ans=0.0 2023-06-24 22:05:57,607 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1193142.0, ans=0.125 2023-06-24 22:05:58,153 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.15 vs. limit=15.0 2023-06-24 22:06:04,600 INFO [train.py:996] (2/4) Epoch 7, batch 15900, loss[loss=0.207, simple_loss=0.2816, pruned_loss=0.06623, over 21861.00 frames. ], tot_loss[loss=0.2147, simple_loss=0.2846, pruned_loss=0.07245, over 4259033.30 frames. ], batch size: 107, lr: 4.30e-03, grad_scale: 16.0 2023-06-24 22:06:54,228 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1193322.0, ans=0.0 2023-06-24 22:07:09,491 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.367e+02 2.996e+02 3.520e+02 4.315e+02 6.246e+02, threshold=7.040e+02, percent-clipped=3.0 2023-06-24 22:07:17,030 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1193382.0, ans=0.2 2023-06-24 22:07:38,610 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.20 vs. limit=15.0 2023-06-24 22:07:43,692 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.20 vs. limit=12.0 2023-06-24 22:07:44,613 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1193442.0, ans=0.0 2023-06-24 22:07:53,117 INFO [train.py:996] (2/4) Epoch 7, batch 15950, loss[loss=0.2519, simple_loss=0.3335, pruned_loss=0.08514, over 21600.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.2864, pruned_loss=0.07157, over 4246985.82 frames. ], batch size: 414, lr: 4.30e-03, grad_scale: 16.0 2023-06-24 22:07:55,533 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1193502.0, ans=0.0 2023-06-24 22:07:58,909 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1193502.0, ans=0.125 2023-06-24 22:08:38,107 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1193622.0, ans=0.0 2023-06-24 22:09:00,727 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1193682.0, ans=0.125 2023-06-24 22:09:28,109 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.21 vs. limit=15.0 2023-06-24 22:09:29,473 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1193742.0, ans=0.0 2023-06-24 22:09:43,002 INFO [train.py:996] (2/4) Epoch 7, batch 16000, loss[loss=0.2065, simple_loss=0.3032, pruned_loss=0.05494, over 21759.00 frames. ], tot_loss[loss=0.2133, simple_loss=0.2876, pruned_loss=0.0695, over 4246833.45 frames. ], batch size: 298, lr: 4.30e-03, grad_scale: 32.0 2023-06-24 22:09:47,664 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1193802.0, ans=0.0 2023-06-24 22:09:52,666 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1193802.0, ans=0.125 2023-06-24 22:09:56,275 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1193802.0, ans=0.125 2023-06-24 22:10:03,690 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1193862.0, ans=0.0 2023-06-24 22:10:19,456 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1193922.0, ans=0.0 2023-06-24 22:10:55,227 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.046e+02 3.010e+02 3.950e+02 5.010e+02 9.750e+02, threshold=7.899e+02, percent-clipped=10.0 2023-06-24 22:11:32,526 INFO [train.py:996] (2/4) Epoch 7, batch 16050, loss[loss=0.2153, simple_loss=0.3112, pruned_loss=0.05973, over 21699.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.2904, pruned_loss=0.06756, over 4254526.75 frames. ], batch size: 247, lr: 4.30e-03, grad_scale: 16.0 2023-06-24 22:11:33,120 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1194102.0, ans=0.2 2023-06-24 22:11:44,898 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1194102.0, ans=0.0 2023-06-24 22:11:45,359 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.93 vs. limit=6.0 2023-06-24 22:11:57,622 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.21 vs. limit=10.0 2023-06-24 22:12:10,985 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1194222.0, ans=0.125 2023-06-24 22:13:06,039 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1194342.0, ans=0.0 2023-06-24 22:13:17,169 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1194342.0, ans=0.125 2023-06-24 22:13:17,811 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.27 vs. limit=15.0 2023-06-24 22:13:19,143 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1194402.0, ans=0.125 2023-06-24 22:13:20,142 INFO [train.py:996] (2/4) Epoch 7, batch 16100, loss[loss=0.2143, simple_loss=0.2908, pruned_loss=0.06895, over 21861.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2966, pruned_loss=0.06871, over 4258906.51 frames. ], batch size: 298, lr: 4.30e-03, grad_scale: 16.0 2023-06-24 22:13:27,322 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 22:13:55,189 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1194522.0, ans=0.2 2023-06-24 22:14:25,476 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.126e+02 3.065e+02 3.753e+02 4.772e+02 1.110e+03, threshold=7.506e+02, percent-clipped=5.0 2023-06-24 22:14:39,589 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1194582.0, ans=0.2 2023-06-24 22:15:06,578 INFO [train.py:996] (2/4) Epoch 7, batch 16150, loss[loss=0.2382, simple_loss=0.3104, pruned_loss=0.08298, over 21786.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.2981, pruned_loss=0.07143, over 4270187.12 frames. ], batch size: 112, lr: 4.30e-03, grad_scale: 16.0 2023-06-24 22:15:25,681 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1194762.0, ans=0.125 2023-06-24 22:15:40,282 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.91 vs. limit=22.5 2023-06-24 22:16:18,242 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1194882.0, ans=0.2 2023-06-24 22:16:33,654 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.37 vs. limit=10.0 2023-06-24 22:16:50,802 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1194942.0, ans=0.95 2023-06-24 22:16:57,007 INFO [train.py:996] (2/4) Epoch 7, batch 16200, loss[loss=0.2084, simple_loss=0.2848, pruned_loss=0.06606, over 21677.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.301, pruned_loss=0.07215, over 4278896.18 frames. ], batch size: 263, lr: 4.30e-03, grad_scale: 16.0 2023-06-24 22:17:01,686 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1195002.0, ans=0.2 2023-06-24 22:18:08,071 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1195122.0, ans=0.0 2023-06-24 22:18:08,862 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.38 vs. limit=15.0 2023-06-24 22:18:15,480 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.476e+02 2.895e+02 3.394e+02 4.172e+02 8.958e+02, threshold=6.788e+02, percent-clipped=2.0 2023-06-24 22:18:18,094 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1195182.0, ans=0.0 2023-06-24 22:18:25,490 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1195182.0, ans=0.0 2023-06-24 22:18:47,719 INFO [train.py:996] (2/4) Epoch 7, batch 16250, loss[loss=0.1803, simple_loss=0.2524, pruned_loss=0.05408, over 21737.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.3003, pruned_loss=0.07248, over 4284341.81 frames. ], batch size: 282, lr: 4.30e-03, grad_scale: 16.0 2023-06-24 22:19:07,610 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1195362.0, ans=0.04949747468305833 2023-06-24 22:19:39,051 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1195422.0, ans=0.0 2023-06-24 22:20:07,674 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.84 vs. limit=15.0 2023-06-24 22:20:12,101 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1195542.0, ans=0.125 2023-06-24 22:20:18,996 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1195542.0, ans=0.035 2023-06-24 22:20:22,426 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1195542.0, ans=0.1 2023-06-24 22:20:22,530 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1195542.0, ans=0.0 2023-06-24 22:20:31,156 INFO [train.py:996] (2/4) Epoch 7, batch 16300, loss[loss=0.1767, simple_loss=0.2676, pruned_loss=0.04289, over 21767.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.294, pruned_loss=0.06908, over 4276473.02 frames. ], batch size: 282, lr: 4.30e-03, grad_scale: 16.0 2023-06-24 22:21:48,947 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.184e+02 2.667e+02 3.225e+02 3.668e+02 6.965e+02, threshold=6.450e+02, percent-clipped=1.0 2023-06-24 22:21:54,541 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1195782.0, ans=0.125 2023-06-24 22:21:55,298 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.01 vs. limit=15.0 2023-06-24 22:22:02,508 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1195842.0, ans=0.125 2023-06-24 22:22:04,648 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn2.whiten.whitening_limit, batch_count=1195842.0, ans=22.5 2023-06-24 22:22:20,757 INFO [train.py:996] (2/4) Epoch 7, batch 16350, loss[loss=0.2518, simple_loss=0.3206, pruned_loss=0.09148, over 21361.00 frames. ], tot_loss[loss=0.2181, simple_loss=0.2955, pruned_loss=0.07041, over 4273921.62 frames. ], batch size: 176, lr: 4.30e-03, grad_scale: 16.0 2023-06-24 22:23:29,173 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1196082.0, ans=0.1 2023-06-24 22:24:04,363 INFO [train.py:996] (2/4) Epoch 7, batch 16400, loss[loss=0.2247, simple_loss=0.3122, pruned_loss=0.06856, over 20759.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.2979, pruned_loss=0.07148, over 4281986.15 frames. ], batch size: 607, lr: 4.30e-03, grad_scale: 32.0 2023-06-24 22:25:10,300 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1196322.0, ans=0.2 2023-06-24 22:25:16,391 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.060e+02 2.934e+02 3.396e+02 4.473e+02 6.388e+02, threshold=6.793e+02, percent-clipped=0.0 2023-06-24 22:25:18,780 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1196382.0, ans=0.125 2023-06-24 22:25:39,534 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.19 vs. limit=15.0 2023-06-24 22:25:42,832 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.85 vs. limit=12.0 2023-06-24 22:25:48,966 INFO [train.py:996] (2/4) Epoch 7, batch 16450, loss[loss=0.2252, simple_loss=0.2969, pruned_loss=0.07677, over 21866.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.2987, pruned_loss=0.07274, over 4287534.72 frames. ], batch size: 351, lr: 4.30e-03, grad_scale: 32.0 2023-06-24 22:25:54,988 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1196502.0, ans=0.2 2023-06-24 22:26:58,086 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1196682.0, ans=0.1 2023-06-24 22:27:32,606 INFO [train.py:996] (2/4) Epoch 7, batch 16500, loss[loss=0.2754, simple_loss=0.3445, pruned_loss=0.1031, over 21515.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.2991, pruned_loss=0.07317, over 4273907.11 frames. ], batch size: 508, lr: 4.30e-03, grad_scale: 32.0 2023-06-24 22:27:57,102 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.07 vs. limit=15.0 2023-06-24 22:28:06,959 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1196862.0, ans=0.09899494936611666 2023-06-24 22:28:25,211 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.48 vs. limit=15.0 2023-06-24 22:28:51,342 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.322e+02 3.249e+02 4.017e+02 5.671e+02 1.121e+03, threshold=8.034e+02, percent-clipped=17.0 2023-06-24 22:28:57,031 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1196982.0, ans=0.125 2023-06-24 22:29:23,160 INFO [train.py:996] (2/4) Epoch 7, batch 16550, loss[loss=0.2322, simple_loss=0.3064, pruned_loss=0.07896, over 21460.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.2962, pruned_loss=0.07103, over 4273288.45 frames. ], batch size: 194, lr: 4.30e-03, grad_scale: 32.0 2023-06-24 22:29:24,450 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.41 vs. limit=22.5 2023-06-24 22:30:03,670 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1197162.0, ans=0.0 2023-06-24 22:30:28,020 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1197222.0, ans=0.1 2023-06-24 22:30:41,517 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1197282.0, ans=0.125 2023-06-24 22:30:46,688 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1197282.0, ans=0.125 2023-06-24 22:31:37,496 INFO [train.py:996] (2/4) Epoch 7, batch 16600, loss[loss=0.2475, simple_loss=0.3339, pruned_loss=0.08057, over 21310.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.3065, pruned_loss=0.07419, over 4272157.22 frames. ], batch size: 176, lr: 4.29e-03, grad_scale: 16.0 2023-06-24 22:31:54,210 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1197462.0, ans=0.0 2023-06-24 22:31:56,354 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1197462.0, ans=0.125 2023-06-24 22:32:28,837 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1197522.0, ans=0.2 2023-06-24 22:32:36,779 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.639e+02 3.261e+02 4.003e+02 5.335e+02 1.096e+03, threshold=8.006e+02, percent-clipped=4.0 2023-06-24 22:32:47,262 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.07 vs. limit=15.0 2023-06-24 22:33:01,515 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.05 vs. limit=12.0 2023-06-24 22:33:26,380 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 22:33:28,073 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1197702.0, ans=0.1 2023-06-24 22:33:29,493 INFO [train.py:996] (2/4) Epoch 7, batch 16650, loss[loss=0.2938, simple_loss=0.3603, pruned_loss=0.1136, over 21449.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3169, pruned_loss=0.07735, over 4276018.07 frames. ], batch size: 471, lr: 4.29e-03, grad_scale: 16.0 2023-06-24 22:33:34,039 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1197702.0, ans=0.125 2023-06-24 22:33:35,459 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1197702.0, ans=0.1 2023-06-24 22:34:02,518 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.14 vs. limit=15.0 2023-06-24 22:34:11,173 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1197822.0, ans=0.0 2023-06-24 22:34:22,429 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 22:34:29,438 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1197882.0, ans=0.2 2023-06-24 22:35:05,412 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1197942.0, ans=0.0 2023-06-24 22:35:17,499 INFO [train.py:996] (2/4) Epoch 7, batch 16700, loss[loss=0.1873, simple_loss=0.256, pruned_loss=0.05932, over 21489.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.3176, pruned_loss=0.07859, over 4273004.78 frames. ], batch size: 211, lr: 4.29e-03, grad_scale: 16.0 2023-06-24 22:35:30,925 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1198002.0, ans=0.125 2023-06-24 22:35:34,690 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.90 vs. limit=15.0 2023-06-24 22:36:37,062 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1198182.0, ans=0.0 2023-06-24 22:36:39,637 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.570e+02 3.449e+02 4.344e+02 5.804e+02 8.392e+02, threshold=8.689e+02, percent-clipped=2.0 2023-06-24 22:37:12,532 INFO [train.py:996] (2/4) Epoch 7, batch 16750, loss[loss=0.2317, simple_loss=0.291, pruned_loss=0.08622, over 19908.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3186, pruned_loss=0.0801, over 4269311.20 frames. ], batch size: 702, lr: 4.29e-03, grad_scale: 16.0 2023-06-24 22:38:16,759 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1198422.0, ans=0.0 2023-06-24 22:38:16,901 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1198422.0, ans=0.1 2023-06-24 22:38:35,689 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1198482.0, ans=0.125 2023-06-24 22:39:02,820 INFO [train.py:996] (2/4) Epoch 7, batch 16800, loss[loss=0.2479, simple_loss=0.3224, pruned_loss=0.08667, over 21854.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.3213, pruned_loss=0.08009, over 4272431.57 frames. ], batch size: 371, lr: 4.29e-03, grad_scale: 32.0 2023-06-24 22:39:08,841 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1198602.0, ans=0.125 2023-06-24 22:39:52,148 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1198662.0, ans=0.1 2023-06-24 22:40:07,385 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1198722.0, ans=0.0 2023-06-24 22:40:20,485 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.624e+02 3.537e+02 4.384e+02 6.125e+02 1.119e+03, threshold=8.769e+02, percent-clipped=3.0 2023-06-24 22:40:22,999 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1198782.0, ans=0.09899494936611666 2023-06-24 22:40:43,113 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1198842.0, ans=0.125 2023-06-24 22:40:55,242 INFO [train.py:996] (2/4) Epoch 7, batch 16850, loss[loss=0.2342, simple_loss=0.315, pruned_loss=0.07675, over 21846.00 frames. ], tot_loss[loss=0.2378, simple_loss=0.3168, pruned_loss=0.0794, over 4283260.39 frames. ], batch size: 124, lr: 4.29e-03, grad_scale: 32.0 2023-06-24 22:41:04,782 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.00 vs. limit=22.5 2023-06-24 22:41:52,412 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1199022.0, ans=0.1 2023-06-24 22:41:59,121 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 22:42:01,534 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.96 vs. limit=10.0 2023-06-24 22:42:47,381 INFO [train.py:996] (2/4) Epoch 7, batch 16900, loss[loss=0.1831, simple_loss=0.2572, pruned_loss=0.05451, over 21501.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3103, pruned_loss=0.07759, over 4288128.79 frames. ], batch size: 212, lr: 4.29e-03, grad_scale: 32.0 2023-06-24 22:43:07,115 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1199202.0, ans=0.125 2023-06-24 22:43:27,468 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1199262.0, ans=0.04949747468305833 2023-06-24 22:43:54,628 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.184e+02 2.672e+02 3.013e+02 3.696e+02 7.423e+02, threshold=6.025e+02, percent-clipped=0.0 2023-06-24 22:44:12,959 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.66 vs. limit=22.5 2023-06-24 22:44:34,219 INFO [train.py:996] (2/4) Epoch 7, batch 16950, loss[loss=0.2686, simple_loss=0.3123, pruned_loss=0.1124, over 21777.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.3036, pruned_loss=0.07628, over 4282146.59 frames. ], batch size: 508, lr: 4.29e-03, grad_scale: 32.0 2023-06-24 22:44:41,955 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1199502.0, ans=0.125 2023-06-24 22:46:21,585 INFO [train.py:996] (2/4) Epoch 7, batch 17000, loss[loss=0.2174, simple_loss=0.2945, pruned_loss=0.07018, over 21885.00 frames. ], tot_loss[loss=0.227, simple_loss=0.3007, pruned_loss=0.07662, over 4292188.11 frames. ], batch size: 332, lr: 4.29e-03, grad_scale: 32.0 2023-06-24 22:46:49,512 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1199862.0, ans=0.5 2023-06-24 22:47:33,178 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.260e+02 3.127e+02 3.708e+02 4.467e+02 7.774e+02, threshold=7.417e+02, percent-clipped=6.0 2023-06-24 22:48:18,539 INFO [train.py:996] (2/4) Epoch 7, batch 17050, loss[loss=0.2542, simple_loss=0.3273, pruned_loss=0.09052, over 21206.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3074, pruned_loss=0.07908, over 4289096.60 frames. ], batch size: 143, lr: 4.29e-03, grad_scale: 32.0 2023-06-24 22:48:21,451 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.61 vs. limit=12.0 2023-06-24 22:48:43,560 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1200162.0, ans=0.0 2023-06-24 22:48:54,873 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.12 vs. limit=22.5 2023-06-24 22:49:43,995 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1200342.0, ans=0.0 2023-06-24 22:50:04,901 INFO [train.py:996] (2/4) Epoch 7, batch 17100, loss[loss=0.216, simple_loss=0.2853, pruned_loss=0.07337, over 21856.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3077, pruned_loss=0.07956, over 4289026.25 frames. ], batch size: 298, lr: 4.29e-03, grad_scale: 32.0 2023-06-24 22:51:07,035 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.328e+02 2.876e+02 3.458e+02 4.009e+02 6.895e+02, threshold=6.917e+02, percent-clipped=0.0 2023-06-24 22:51:09,183 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1200582.0, ans=0.2 2023-06-24 22:51:46,952 INFO [train.py:996] (2/4) Epoch 7, batch 17150, loss[loss=0.1828, simple_loss=0.2646, pruned_loss=0.05053, over 21807.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.3028, pruned_loss=0.07818, over 4292981.48 frames. ], batch size: 351, lr: 4.29e-03, grad_scale: 32.0 2023-06-24 22:51:47,596 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1200702.0, ans=0.0 2023-06-24 22:52:14,077 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1200762.0, ans=0.125 2023-06-24 22:52:28,222 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1200822.0, ans=0.125 2023-06-24 22:53:14,508 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.54 vs. limit=15.0 2023-06-24 22:53:42,036 INFO [train.py:996] (2/4) Epoch 7, batch 17200, loss[loss=0.29, simple_loss=0.3479, pruned_loss=0.1161, over 21415.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.3031, pruned_loss=0.07811, over 4293851.69 frames. ], batch size: 471, lr: 4.29e-03, grad_scale: 32.0 2023-06-24 22:53:48,683 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1201002.0, ans=0.125 2023-06-24 22:53:51,945 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1201002.0, ans=0.1 2023-06-24 22:54:20,268 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1201122.0, ans=10.0 2023-06-24 22:54:53,180 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.136e+02 2.812e+02 3.269e+02 4.158e+02 6.698e+02, threshold=6.538e+02, percent-clipped=0.0 2023-06-24 22:55:33,465 INFO [train.py:996] (2/4) Epoch 7, batch 17250, loss[loss=0.264, simple_loss=0.335, pruned_loss=0.09646, over 21335.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3061, pruned_loss=0.07968, over 4294319.60 frames. ], batch size: 549, lr: 4.29e-03, grad_scale: 32.0 2023-06-24 22:55:36,155 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1201302.0, ans=0.125 2023-06-24 22:55:48,469 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1201302.0, ans=0.1 2023-06-24 22:57:08,617 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1201542.0, ans=0.05 2023-06-24 22:57:24,164 INFO [train.py:996] (2/4) Epoch 7, batch 17300, loss[loss=0.2757, simple_loss=0.3521, pruned_loss=0.09965, over 21438.00 frames. ], tot_loss[loss=0.241, simple_loss=0.3149, pruned_loss=0.08348, over 4288563.62 frames. ], batch size: 131, lr: 4.29e-03, grad_scale: 16.0 2023-06-24 22:58:47,371 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.589e+02 3.154e+02 3.783e+02 4.784e+02 7.470e+02, threshold=7.566e+02, percent-clipped=5.0 2023-06-24 22:58:50,119 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 22:59:15,015 INFO [train.py:996] (2/4) Epoch 7, batch 17350, loss[loss=0.2159, simple_loss=0.3119, pruned_loss=0.06001, over 19835.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3147, pruned_loss=0.08237, over 4281850.26 frames. ], batch size: 702, lr: 4.29e-03, grad_scale: 16.0 2023-06-24 23:00:40,622 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 23:01:04,955 INFO [train.py:996] (2/4) Epoch 7, batch 17400, loss[loss=0.213, simple_loss=0.2968, pruned_loss=0.06457, over 21825.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3108, pruned_loss=0.07944, over 4278547.19 frames. ], batch size: 316, lr: 4.29e-03, grad_scale: 16.0 2023-06-24 23:01:05,690 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1202202.0, ans=0.0 2023-06-24 23:01:31,983 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.62 vs. limit=15.0 2023-06-24 23:02:09,632 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.67 vs. limit=10.0 2023-06-24 23:02:12,836 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1202322.0, ans=0.125 2023-06-24 23:02:16,319 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1202322.0, ans=0.125 2023-06-24 23:02:19,532 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1202382.0, ans=0.015 2023-06-24 23:02:28,356 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.135e+02 3.058e+02 3.682e+02 4.915e+02 8.567e+02, threshold=7.364e+02, percent-clipped=2.0 2023-06-24 23:02:36,656 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.31 vs. limit=15.0 2023-06-24 23:03:05,917 INFO [train.py:996] (2/4) Epoch 7, batch 17450, loss[loss=0.1992, simple_loss=0.2996, pruned_loss=0.04938, over 21583.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.3044, pruned_loss=0.07656, over 4271788.06 frames. ], batch size: 389, lr: 4.29e-03, grad_scale: 16.0 2023-06-24 23:03:59,762 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.45 vs. limit=10.0 2023-06-24 23:04:24,410 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1202742.0, ans=0.125 2023-06-24 23:04:27,106 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.76 vs. limit=12.0 2023-06-24 23:04:42,786 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1202742.0, ans=0.025 2023-06-24 23:04:59,051 INFO [train.py:996] (2/4) Epoch 7, batch 17500, loss[loss=0.2299, simple_loss=0.2983, pruned_loss=0.08072, over 21319.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.3004, pruned_loss=0.07419, over 4273461.51 frames. ], batch size: 159, lr: 4.29e-03, grad_scale: 16.0 2023-06-24 23:05:09,892 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1202802.0, ans=0.125 2023-06-24 23:05:43,823 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1202922.0, ans=0.125 2023-06-24 23:06:04,514 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.111e+02 2.853e+02 3.403e+02 4.672e+02 8.323e+02, threshold=6.806e+02, percent-clipped=1.0 2023-06-24 23:06:44,182 INFO [train.py:996] (2/4) Epoch 7, batch 17550, loss[loss=0.2359, simple_loss=0.3194, pruned_loss=0.07624, over 21282.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.3011, pruned_loss=0.07323, over 4268466.57 frames. ], batch size: 143, lr: 4.28e-03, grad_scale: 8.0 2023-06-24 23:07:00,015 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1203162.0, ans=0.09899494936611666 2023-06-24 23:07:23,077 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 23:07:31,674 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1203222.0, ans=0.0 2023-06-24 23:07:42,736 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1203282.0, ans=0.5 2023-06-24 23:08:32,341 INFO [train.py:996] (2/4) Epoch 7, batch 17600, loss[loss=0.2602, simple_loss=0.3368, pruned_loss=0.09175, over 21187.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.3034, pruned_loss=0.07321, over 4259930.23 frames. ], batch size: 143, lr: 4.28e-03, grad_scale: 16.0 2023-06-24 23:08:47,284 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1203402.0, ans=10.0 2023-06-24 23:08:56,559 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1203462.0, ans=0.125 2023-06-24 23:09:41,271 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.027e+02 2.778e+02 3.294e+02 4.134e+02 8.304e+02, threshold=6.589e+02, percent-clipped=2.0 2023-06-24 23:09:54,251 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.53 vs. limit=15.0 2023-06-24 23:09:59,218 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1203642.0, ans=0.1 2023-06-24 23:10:12,460 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1203642.0, ans=0.125 2023-06-24 23:10:19,211 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1203702.0, ans=0.125 2023-06-24 23:10:20,648 INFO [train.py:996] (2/4) Epoch 7, batch 17650, loss[loss=0.2514, simple_loss=0.3206, pruned_loss=0.09111, over 21533.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.302, pruned_loss=0.07352, over 4250862.70 frames. ], batch size: 509, lr: 4.28e-03, grad_scale: 16.0 2023-06-24 23:10:26,318 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1203702.0, ans=0.2 2023-06-24 23:10:33,379 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1203702.0, ans=0.2 2023-06-24 23:10:56,818 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.52 vs. limit=22.5 2023-06-24 23:11:08,261 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1203822.0, ans=0.0 2023-06-24 23:12:12,067 INFO [train.py:996] (2/4) Epoch 7, batch 17700, loss[loss=0.2616, simple_loss=0.3463, pruned_loss=0.08841, over 21574.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.295, pruned_loss=0.07022, over 4247166.29 frames. ], batch size: 414, lr: 4.28e-03, grad_scale: 16.0 2023-06-24 23:12:12,524 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1204002.0, ans=0.2 2023-06-24 23:12:39,098 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1204062.0, ans=0.125 2023-06-24 23:12:54,270 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.64 vs. limit=15.0 2023-06-24 23:13:23,986 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.59 vs. limit=15.0 2023-06-24 23:13:30,882 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.242e+02 2.965e+02 3.854e+02 5.323e+02 9.978e+02, threshold=7.709e+02, percent-clipped=16.0 2023-06-24 23:14:06,537 INFO [train.py:996] (2/4) Epoch 7, batch 17750, loss[loss=0.212, simple_loss=0.2897, pruned_loss=0.0671, over 19983.00 frames. ], tot_loss[loss=0.2248, simple_loss=0.3028, pruned_loss=0.0734, over 4254773.50 frames. ], batch size: 703, lr: 4.28e-03, grad_scale: 16.0 2023-06-24 23:14:11,555 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.36 vs. limit=15.0 2023-06-24 23:14:16,515 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1204302.0, ans=0.1 2023-06-24 23:14:19,633 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1204302.0, ans=0.2 2023-06-24 23:14:27,166 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.87 vs. limit=15.0 2023-06-24 23:14:36,048 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.84 vs. limit=10.0 2023-06-24 23:15:06,452 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1204422.0, ans=0.2 2023-06-24 23:15:56,644 INFO [train.py:996] (2/4) Epoch 7, batch 17800, loss[loss=0.1882, simple_loss=0.2628, pruned_loss=0.05679, over 21278.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.3032, pruned_loss=0.07291, over 4262985.88 frames. ], batch size: 159, lr: 4.28e-03, grad_scale: 16.0 2023-06-24 23:16:30,008 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1204662.0, ans=0.025 2023-06-24 23:17:23,013 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.078e+02 2.835e+02 3.431e+02 4.472e+02 1.183e+03, threshold=6.863e+02, percent-clipped=3.0 2023-06-24 23:17:47,982 INFO [train.py:996] (2/4) Epoch 7, batch 17850, loss[loss=0.2611, simple_loss=0.3669, pruned_loss=0.07762, over 20704.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.3033, pruned_loss=0.07284, over 4265402.08 frames. ], batch size: 607, lr: 4.28e-03, grad_scale: 16.0 2023-06-24 23:18:10,846 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.28 vs. limit=6.0 2023-06-24 23:18:39,535 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1205022.0, ans=0.125 2023-06-24 23:19:38,050 INFO [train.py:996] (2/4) Epoch 7, batch 17900, loss[loss=0.2278, simple_loss=0.3173, pruned_loss=0.06911, over 21435.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3078, pruned_loss=0.07383, over 4261455.29 frames. ], batch size: 194, lr: 4.28e-03, grad_scale: 16.0 2023-06-24 23:19:45,480 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1205202.0, ans=0.125 2023-06-24 23:20:15,706 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1205262.0, ans=0.2 2023-06-24 23:20:24,369 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1205262.0, ans=0.125 2023-06-24 23:20:27,646 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1205322.0, ans=0.125 2023-06-24 23:20:44,013 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.38 vs. limit=22.5 2023-06-24 23:21:00,593 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1205382.0, ans=10.0 2023-06-24 23:21:03,387 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.271e+02 2.987e+02 3.415e+02 4.264e+02 7.391e+02, threshold=6.831e+02, percent-clipped=3.0 2023-06-24 23:21:21,711 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1205442.0, ans=0.2 2023-06-24 23:21:23,839 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1205442.0, ans=0.125 2023-06-24 23:21:26,841 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1205502.0, ans=0.0 2023-06-24 23:21:27,871 INFO [train.py:996] (2/4) Epoch 7, batch 17950, loss[loss=0.2023, simple_loss=0.2958, pruned_loss=0.05442, over 21760.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.3059, pruned_loss=0.07065, over 4257436.88 frames. ], batch size: 332, lr: 4.28e-03, grad_scale: 16.0 2023-06-24 23:22:20,531 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1205562.0, ans=0.125 2023-06-24 23:22:26,275 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1205622.0, ans=0.125 2023-06-24 23:23:03,545 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1205742.0, ans=0.0 2023-06-24 23:23:19,962 INFO [train.py:996] (2/4) Epoch 7, batch 18000, loss[loss=0.1859, simple_loss=0.2432, pruned_loss=0.06428, over 21120.00 frames. ], tot_loss[loss=0.22, simple_loss=0.3001, pruned_loss=0.06996, over 4258052.12 frames. ], batch size: 548, lr: 4.28e-03, grad_scale: 32.0 2023-06-24 23:23:19,963 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-24 23:23:40,289 INFO [train.py:1028] (2/4) Epoch 7, validation: loss=0.2616, simple_loss=0.3599, pruned_loss=0.08162, over 1796401.00 frames. 2023-06-24 23:23:40,290 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 23731MB 2023-06-24 23:24:10,110 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1205862.0, ans=0.125 2023-06-24 23:24:18,271 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1205862.0, ans=0.125 2023-06-24 23:24:48,212 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1205982.0, ans=0.2 2023-06-24 23:24:48,336 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 23:24:55,040 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.910e+02 2.947e+02 3.493e+02 4.464e+02 9.866e+02, threshold=6.986e+02, percent-clipped=5.0 2023-06-24 23:25:13,273 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1206042.0, ans=0.125 2023-06-24 23:25:32,656 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1206042.0, ans=0.0 2023-06-24 23:25:32,663 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1206042.0, ans=0.125 2023-06-24 23:25:35,642 INFO [train.py:996] (2/4) Epoch 7, batch 18050, loss[loss=0.2237, simple_loss=0.2855, pruned_loss=0.08096, over 21292.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2955, pruned_loss=0.06924, over 4265126.16 frames. ], batch size: 471, lr: 4.28e-03, grad_scale: 32.0 2023-06-24 23:26:05,375 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.20 vs. limit=22.5 2023-06-24 23:26:12,775 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1206162.0, ans=0.1 2023-06-24 23:26:40,974 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1206282.0, ans=0.1 2023-06-24 23:26:44,854 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.08 vs. limit=22.5 2023-06-24 23:27:32,135 INFO [train.py:996] (2/4) Epoch 7, batch 18100, loss[loss=0.2346, simple_loss=0.3232, pruned_loss=0.073, over 21251.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.3005, pruned_loss=0.07186, over 4261145.32 frames. ], batch size: 143, lr: 4.28e-03, grad_scale: 32.0 2023-06-24 23:27:37,463 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1206402.0, ans=0.125 2023-06-24 23:28:44,815 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.264e+02 2.881e+02 3.345e+02 4.009e+02 7.924e+02, threshold=6.690e+02, percent-clipped=2.0 2023-06-24 23:28:47,797 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.89 vs. limit=15.0 2023-06-24 23:28:55,457 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1206642.0, ans=0.0 2023-06-24 23:29:13,164 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1206702.0, ans=0.1 2023-06-24 23:29:14,115 INFO [train.py:996] (2/4) Epoch 7, batch 18150, loss[loss=0.2002, simple_loss=0.2718, pruned_loss=0.06427, over 21458.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.3018, pruned_loss=0.07169, over 4254357.57 frames. ], batch size: 212, lr: 4.28e-03, grad_scale: 32.0 2023-06-24 23:29:35,704 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.38 vs. limit=15.0 2023-06-24 23:29:36,965 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1206762.0, ans=0.125 2023-06-24 23:30:02,491 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1206822.0, ans=0.025 2023-06-24 23:30:05,911 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1206822.0, ans=0.05 2023-06-24 23:30:09,925 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1206882.0, ans=0.125 2023-06-24 23:30:21,465 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1206882.0, ans=0.2 2023-06-24 23:30:43,140 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1206942.0, ans=0.125 2023-06-24 23:30:56,552 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1206942.0, ans=0.125 2023-06-24 23:30:59,298 INFO [train.py:996] (2/4) Epoch 7, batch 18200, loss[loss=0.1982, simple_loss=0.2696, pruned_loss=0.06339, over 21467.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.2964, pruned_loss=0.0714, over 4245199.73 frames. ], batch size: 211, lr: 4.28e-03, grad_scale: 32.0 2023-06-24 23:31:31,780 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 23:31:41,301 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1207122.0, ans=0.125 2023-06-24 23:31:48,431 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1207122.0, ans=0.0 2023-06-24 23:31:51,877 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1207122.0, ans=10.0 2023-06-24 23:32:04,652 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.52 vs. limit=10.0 2023-06-24 23:32:05,011 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.956e+02 2.874e+02 3.635e+02 5.188e+02 1.150e+03, threshold=7.270e+02, percent-clipped=9.0 2023-06-24 23:32:10,437 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1207242.0, ans=0.0 2023-06-24 23:32:20,849 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1207242.0, ans=0.2 2023-06-24 23:32:38,604 INFO [train.py:996] (2/4) Epoch 7, batch 18250, loss[loss=0.2448, simple_loss=0.325, pruned_loss=0.08232, over 19901.00 frames. ], tot_loss[loss=0.2149, simple_loss=0.29, pruned_loss=0.06991, over 4246515.82 frames. ], batch size: 702, lr: 4.28e-03, grad_scale: 32.0 2023-06-24 23:33:19,940 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1207362.0, ans=0.0 2023-06-24 23:33:40,429 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1207482.0, ans=0.0 2023-06-24 23:34:24,258 INFO [train.py:996] (2/4) Epoch 7, batch 18300, loss[loss=0.2659, simple_loss=0.3657, pruned_loss=0.08309, over 21673.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.2889, pruned_loss=0.06937, over 4250823.98 frames. ], batch size: 389, lr: 4.28e-03, grad_scale: 16.0 2023-06-24 23:34:30,492 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.89 vs. limit=15.0 2023-06-24 23:34:43,459 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=1207602.0, ans=15.0 2023-06-24 23:34:46,905 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.04 vs. limit=22.5 2023-06-24 23:35:39,694 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.046e+02 2.910e+02 3.541e+02 4.206e+02 1.059e+03, threshold=7.082e+02, percent-clipped=3.0 2023-06-24 23:35:40,250 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1207782.0, ans=0.125 2023-06-24 23:35:44,762 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.72 vs. limit=12.0 2023-06-24 23:35:52,274 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1207842.0, ans=0.2 2023-06-24 23:35:56,217 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.24 vs. limit=8.0 2023-06-24 23:35:57,351 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1207842.0, ans=0.1 2023-06-24 23:36:12,377 INFO [train.py:996] (2/4) Epoch 7, batch 18350, loss[loss=0.1679, simple_loss=0.244, pruned_loss=0.04591, over 17129.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2955, pruned_loss=0.06966, over 4254004.76 frames. ], batch size: 65, lr: 4.28e-03, grad_scale: 16.0 2023-06-24 23:36:31,068 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1207902.0, ans=0.125 2023-06-24 23:36:42,992 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1207962.0, ans=0.2 2023-06-24 23:37:58,226 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1208142.0, ans=0.07 2023-06-24 23:38:01,050 INFO [train.py:996] (2/4) Epoch 7, batch 18400, loss[loss=0.1599, simple_loss=0.2535, pruned_loss=0.03314, over 21751.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.2907, pruned_loss=0.06844, over 4253138.69 frames. ], batch size: 333, lr: 4.28e-03, grad_scale: 32.0 2023-06-24 23:38:14,661 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 23:38:36,242 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1208262.0, ans=0.125 2023-06-24 23:38:37,995 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1208262.0, ans=0.125 2023-06-24 23:38:46,805 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1208322.0, ans=0.125 2023-06-24 23:39:16,996 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.081e+02 2.559e+02 3.009e+02 3.655e+02 5.951e+02, threshold=6.019e+02, percent-clipped=0.0 2023-06-24 23:39:49,319 INFO [train.py:996] (2/4) Epoch 7, batch 18450, loss[loss=0.1839, simple_loss=0.2731, pruned_loss=0.04733, over 21717.00 frames. ], tot_loss[loss=0.2082, simple_loss=0.287, pruned_loss=0.06473, over 4249238.74 frames. ], batch size: 298, lr: 4.27e-03, grad_scale: 32.0 2023-06-24 23:40:33,457 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1208622.0, ans=0.125 2023-06-24 23:40:44,291 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1208622.0, ans=0.0 2023-06-24 23:40:58,573 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1208682.0, ans=0.1 2023-06-24 23:41:28,459 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.51 vs. limit=15.0 2023-06-24 23:41:35,266 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1208742.0, ans=0.2 2023-06-24 23:41:38,254 INFO [train.py:996] (2/4) Epoch 7, batch 18500, loss[loss=0.199, simple_loss=0.2645, pruned_loss=0.06679, over 21904.00 frames. ], tot_loss[loss=0.2055, simple_loss=0.283, pruned_loss=0.06399, over 4251675.95 frames. ], batch size: 98, lr: 4.27e-03, grad_scale: 32.0 2023-06-24 23:41:40,616 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1208802.0, ans=0.1 2023-06-24 23:41:42,663 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.63 vs. limit=10.0 2023-06-24 23:42:59,446 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.887e+02 2.868e+02 3.588e+02 5.410e+02 1.340e+03, threshold=7.175e+02, percent-clipped=18.0 2023-06-24 23:43:00,054 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1208982.0, ans=0.05 2023-06-24 23:43:02,024 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1209042.0, ans=0.125 2023-06-24 23:43:21,389 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1209042.0, ans=0.1 2023-06-24 23:43:25,443 INFO [train.py:996] (2/4) Epoch 7, batch 18550, loss[loss=0.1956, simple_loss=0.2768, pruned_loss=0.05723, over 21679.00 frames. ], tot_loss[loss=0.2049, simple_loss=0.2826, pruned_loss=0.06365, over 4250994.00 frames. ], batch size: 298, lr: 4.27e-03, grad_scale: 16.0 2023-06-24 23:45:13,379 INFO [train.py:996] (2/4) Epoch 7, batch 18600, loss[loss=0.2072, simple_loss=0.2679, pruned_loss=0.07323, over 21202.00 frames. ], tot_loss[loss=0.2038, simple_loss=0.2798, pruned_loss=0.06391, over 4243045.47 frames. ], batch size: 176, lr: 4.27e-03, grad_scale: 16.0 2023-06-24 23:45:15,861 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1209402.0, ans=0.0 2023-06-24 23:45:28,568 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1209402.0, ans=0.0 2023-06-24 23:45:38,680 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1209462.0, ans=0.0 2023-06-24 23:46:35,140 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.056e+02 2.703e+02 3.435e+02 4.233e+02 7.811e+02, threshold=6.869e+02, percent-clipped=3.0 2023-06-24 23:47:01,155 INFO [train.py:996] (2/4) Epoch 7, batch 18650, loss[loss=0.2303, simple_loss=0.2895, pruned_loss=0.08557, over 21879.00 frames. ], tot_loss[loss=0.204, simple_loss=0.2798, pruned_loss=0.06413, over 4248539.73 frames. ], batch size: 107, lr: 4.27e-03, grad_scale: 16.0 2023-06-24 23:48:16,824 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1209882.0, ans=0.0 2023-06-24 23:48:32,151 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1209942.0, ans=0.0 2023-06-24 23:48:43,958 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1209942.0, ans=0.2 2023-06-24 23:48:48,674 INFO [train.py:996] (2/4) Epoch 7, batch 18700, loss[loss=0.2056, simple_loss=0.2688, pruned_loss=0.07123, over 21603.00 frames. ], tot_loss[loss=0.2041, simple_loss=0.2775, pruned_loss=0.06535, over 4260905.72 frames. ], batch size: 263, lr: 4.27e-03, grad_scale: 16.0 2023-06-24 23:49:27,824 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1210062.0, ans=0.125 2023-06-24 23:49:34,766 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1210122.0, ans=0.1 2023-06-24 23:49:41,857 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1210122.0, ans=0.0 2023-06-24 23:50:10,825 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.160e+02 2.803e+02 3.350e+02 3.905e+02 5.845e+02, threshold=6.700e+02, percent-clipped=0.0 2023-06-24 23:50:36,541 INFO [train.py:996] (2/4) Epoch 7, batch 18750, loss[loss=0.2212, simple_loss=0.2789, pruned_loss=0.0818, over 21595.00 frames. ], tot_loss[loss=0.2076, simple_loss=0.2795, pruned_loss=0.06787, over 4272420.02 frames. ], batch size: 548, lr: 4.27e-03, grad_scale: 16.0 2023-06-24 23:52:11,229 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.94 vs. limit=22.5 2023-06-24 23:52:18,611 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1210542.0, ans=0.1 2023-06-24 23:52:22,899 INFO [train.py:996] (2/4) Epoch 7, batch 18800, loss[loss=0.2219, simple_loss=0.3237, pruned_loss=0.06007, over 20821.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.285, pruned_loss=0.06891, over 4273664.33 frames. ], batch size: 608, lr: 4.27e-03, grad_scale: 32.0 2023-06-24 23:52:25,069 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1210602.0, ans=0.0 2023-06-24 23:52:35,495 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1210602.0, ans=0.125 2023-06-24 23:52:46,686 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.87 vs. limit=15.0 2023-06-24 23:52:52,612 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1210662.0, ans=0.1 2023-06-24 23:53:08,726 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1210722.0, ans=0.2 2023-06-24 23:53:43,563 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.736e+02 2.643e+02 3.373e+02 4.457e+02 8.790e+02, threshold=6.746e+02, percent-clipped=4.0 2023-06-24 23:53:49,385 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1210842.0, ans=0.2 2023-06-24 23:54:09,227 INFO [train.py:996] (2/4) Epoch 7, batch 18850, loss[loss=0.1808, simple_loss=0.2515, pruned_loss=0.05505, over 21524.00 frames. ], tot_loss[loss=0.2059, simple_loss=0.2816, pruned_loss=0.06508, over 4263908.41 frames. ], batch size: 195, lr: 4.27e-03, grad_scale: 32.0 2023-06-24 23:54:35,075 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1210962.0, ans=0.125 2023-06-24 23:55:19,764 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=19.21 vs. limit=22.5 2023-06-24 23:55:22,820 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.87 vs. limit=15.0 2023-06-24 23:55:47,681 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1211142.0, ans=0.125 2023-06-24 23:55:56,255 INFO [train.py:996] (2/4) Epoch 7, batch 18900, loss[loss=0.1829, simple_loss=0.2465, pruned_loss=0.05963, over 21542.00 frames. ], tot_loss[loss=0.2047, simple_loss=0.2792, pruned_loss=0.06512, over 4265045.54 frames. ], batch size: 231, lr: 4.27e-03, grad_scale: 32.0 2023-06-24 23:57:17,723 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.856e+02 2.759e+02 3.206e+02 4.379e+02 8.069e+02, threshold=6.411e+02, percent-clipped=2.0 2023-06-24 23:57:39,805 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1211442.0, ans=0.125 2023-06-24 23:57:39,867 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1211442.0, ans=0.2 2023-06-24 23:57:44,063 INFO [train.py:996] (2/4) Epoch 7, batch 18950, loss[loss=0.2319, simple_loss=0.3039, pruned_loss=0.07994, over 21821.00 frames. ], tot_loss[loss=0.2082, simple_loss=0.2811, pruned_loss=0.06766, over 4277906.78 frames. ], batch size: 124, lr: 4.27e-03, grad_scale: 16.0 2023-06-24 23:58:04,943 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1211502.0, ans=0.0 2023-06-24 23:58:28,906 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1211622.0, ans=0.125 2023-06-24 23:58:41,492 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1211622.0, ans=0.0 2023-06-24 23:59:36,189 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1211742.0, ans=0.0 2023-06-24 23:59:39,058 INFO [train.py:996] (2/4) Epoch 7, batch 19000, loss[loss=0.2564, simple_loss=0.3396, pruned_loss=0.08663, over 21880.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.2909, pruned_loss=0.06931, over 4281067.46 frames. ], batch size: 372, lr: 4.27e-03, grad_scale: 16.0 2023-06-25 00:00:16,593 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1211862.0, ans=0.2 2023-06-25 00:01:02,158 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.437e+02 3.130e+02 3.898e+02 4.619e+02 8.945e+02, threshold=7.797e+02, percent-clipped=5.0 2023-06-25 00:01:26,895 INFO [train.py:996] (2/4) Epoch 7, batch 19050, loss[loss=0.2099, simple_loss=0.2815, pruned_loss=0.06911, over 21660.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.2978, pruned_loss=0.07364, over 4288838.56 frames. ], batch size: 263, lr: 4.27e-03, grad_scale: 16.0 2023-06-25 00:01:29,571 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1212102.0, ans=0.0 2023-06-25 00:02:21,289 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.08 vs. limit=15.0 2023-06-25 00:03:13,223 INFO [train.py:996] (2/4) Epoch 7, batch 19100, loss[loss=0.2278, simple_loss=0.2882, pruned_loss=0.0837, over 21486.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.2954, pruned_loss=0.07438, over 4291612.38 frames. ], batch size: 548, lr: 4.27e-03, grad_scale: 16.0 2023-06-25 00:03:35,417 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1212462.0, ans=0.125 2023-06-25 00:04:34,157 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 00:04:38,991 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.215e+02 2.814e+02 3.416e+02 4.391e+02 9.529e+02, threshold=6.832e+02, percent-clipped=4.0 2023-06-25 00:04:50,919 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1212642.0, ans=0.2 2023-06-25 00:04:52,686 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1212642.0, ans=0.125 2023-06-25 00:05:04,682 INFO [train.py:996] (2/4) Epoch 7, batch 19150, loss[loss=0.2191, simple_loss=0.3052, pruned_loss=0.06653, over 21412.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.2962, pruned_loss=0.07459, over 4282521.02 frames. ], batch size: 194, lr: 4.27e-03, grad_scale: 16.0 2023-06-25 00:05:30,797 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1212762.0, ans=0.0 2023-06-25 00:05:48,660 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1212762.0, ans=0.125 2023-06-25 00:06:45,983 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.24 vs. limit=15.0 2023-06-25 00:06:50,478 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 00:07:00,465 INFO [train.py:996] (2/4) Epoch 7, batch 19200, loss[loss=0.2725, simple_loss=0.3952, pruned_loss=0.07492, over 20728.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.3079, pruned_loss=0.07645, over 4281935.43 frames. ], batch size: 607, lr: 4.27e-03, grad_scale: 32.0 2023-06-25 00:07:24,753 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.56 vs. limit=15.0 2023-06-25 00:07:49,481 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1213122.0, ans=0.125 2023-06-25 00:08:23,639 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.016e+02 3.201e+02 4.532e+02 8.099e+02 1.362e+03, threshold=9.063e+02, percent-clipped=31.0 2023-06-25 00:08:48,666 INFO [train.py:996] (2/4) Epoch 7, batch 19250, loss[loss=0.1684, simple_loss=0.2657, pruned_loss=0.03557, over 21726.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.3078, pruned_loss=0.07166, over 4279583.16 frames. ], batch size: 298, lr: 4.27e-03, grad_scale: 32.0 2023-06-25 00:08:54,260 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1213302.0, ans=0.0 2023-06-25 00:09:18,718 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1213362.0, ans=0.0 2023-06-25 00:09:37,587 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1213422.0, ans=0.0 2023-06-25 00:09:51,613 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.87 vs. limit=10.0 2023-06-25 00:10:29,818 INFO [train.py:996] (2/4) Epoch 7, batch 19300, loss[loss=0.2016, simple_loss=0.2734, pruned_loss=0.06493, over 21471.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.3039, pruned_loss=0.06982, over 4282680.99 frames. ], batch size: 194, lr: 4.27e-03, grad_scale: 16.0 2023-06-25 00:11:26,980 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1213722.0, ans=0.0 2023-06-25 00:12:02,040 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.816e+02 2.613e+02 3.067e+02 3.986e+02 9.865e+02, threshold=6.134e+02, percent-clipped=1.0 2023-06-25 00:12:07,693 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1213842.0, ans=0.0 2023-06-25 00:12:24,977 INFO [train.py:996] (2/4) Epoch 7, batch 19350, loss[loss=0.1771, simple_loss=0.257, pruned_loss=0.04865, over 21434.00 frames. ], tot_loss[loss=0.2147, simple_loss=0.2971, pruned_loss=0.06618, over 4277388.12 frames. ], batch size: 195, lr: 4.27e-03, grad_scale: 16.0 2023-06-25 00:12:27,335 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 00:12:40,471 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.26 vs. limit=15.0 2023-06-25 00:12:49,428 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1213962.0, ans=0.125 2023-06-25 00:13:17,018 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.78 vs. limit=15.0 2023-06-25 00:13:39,110 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.27 vs. limit=15.0 2023-06-25 00:14:08,448 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1214142.0, ans=0.05 2023-06-25 00:14:08,487 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1214142.0, ans=0.0 2023-06-25 00:14:11,258 INFO [train.py:996] (2/4) Epoch 7, batch 19400, loss[loss=0.1876, simple_loss=0.2587, pruned_loss=0.05829, over 21207.00 frames. ], tot_loss[loss=0.212, simple_loss=0.2937, pruned_loss=0.06519, over 4275445.49 frames. ], batch size: 143, lr: 4.26e-03, grad_scale: 16.0 2023-06-25 00:14:21,153 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.17 vs. limit=10.0 2023-06-25 00:14:28,812 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1214202.0, ans=0.05 2023-06-25 00:14:32,661 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1214262.0, ans=0.125 2023-06-25 00:15:17,207 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.33 vs. limit=12.0 2023-06-25 00:15:34,736 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.117e+02 2.871e+02 3.427e+02 4.239e+02 8.208e+02, threshold=6.853e+02, percent-clipped=6.0 2023-06-25 00:15:42,475 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1214442.0, ans=0.125 2023-06-25 00:15:58,299 INFO [train.py:996] (2/4) Epoch 7, batch 19450, loss[loss=0.237, simple_loss=0.301, pruned_loss=0.08649, over 21760.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.2913, pruned_loss=0.06716, over 4279786.70 frames. ], batch size: 102, lr: 4.26e-03, grad_scale: 16.0 2023-06-25 00:16:45,360 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1214622.0, ans=0.125 2023-06-25 00:16:47,344 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1214622.0, ans=0.125 2023-06-25 00:17:15,376 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1214682.0, ans=0.125 2023-06-25 00:17:46,961 INFO [train.py:996] (2/4) Epoch 7, batch 19500, loss[loss=0.1854, simple_loss=0.2355, pruned_loss=0.06765, over 20832.00 frames. ], tot_loss[loss=0.2121, simple_loss=0.2873, pruned_loss=0.06844, over 4274631.23 frames. ], batch size: 608, lr: 4.26e-03, grad_scale: 16.0 2023-06-25 00:18:25,381 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1214862.0, ans=0.1 2023-06-25 00:18:47,469 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.79 vs. limit=15.0 2023-06-25 00:19:11,582 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1214982.0, ans=0.125 2023-06-25 00:19:13,602 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1215042.0, ans=0.1 2023-06-25 00:19:14,511 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.308e+02 2.919e+02 3.343e+02 4.176e+02 7.589e+02, threshold=6.686e+02, percent-clipped=2.0 2023-06-25 00:19:34,034 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1215042.0, ans=0.1 2023-06-25 00:19:36,582 INFO [train.py:996] (2/4) Epoch 7, batch 19550, loss[loss=0.1645, simple_loss=0.2284, pruned_loss=0.05029, over 21232.00 frames. ], tot_loss[loss=0.2107, simple_loss=0.2852, pruned_loss=0.06808, over 4277684.61 frames. ], batch size: 159, lr: 4.26e-03, grad_scale: 16.0 2023-06-25 00:19:37,047 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1215102.0, ans=0.125 2023-06-25 00:19:50,103 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.92 vs. limit=6.0 2023-06-25 00:19:56,215 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1215102.0, ans=0.0 2023-06-25 00:20:01,974 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.88 vs. limit=15.0 2023-06-25 00:20:18,319 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1215162.0, ans=0.0 2023-06-25 00:20:31,153 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1215222.0, ans=0.125 2023-06-25 00:20:32,558 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1215222.0, ans=0.125 2023-06-25 00:20:57,434 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1215282.0, ans=0.125 2023-06-25 00:21:26,528 INFO [train.py:996] (2/4) Epoch 7, batch 19600, loss[loss=0.2444, simple_loss=0.3111, pruned_loss=0.08883, over 21741.00 frames. ], tot_loss[loss=0.213, simple_loss=0.2877, pruned_loss=0.06916, over 4279795.41 frames. ], batch size: 389, lr: 4.26e-03, grad_scale: 32.0 2023-06-25 00:22:05,075 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1215462.0, ans=0.1 2023-06-25 00:22:08,799 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1215462.0, ans=0.125 2023-06-25 00:22:46,141 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1215582.0, ans=0.125 2023-06-25 00:22:52,380 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.425e+02 3.092e+02 3.648e+02 4.642e+02 7.608e+02, threshold=7.295e+02, percent-clipped=3.0 2023-06-25 00:23:21,465 INFO [train.py:996] (2/4) Epoch 7, batch 19650, loss[loss=0.2427, simple_loss=0.3103, pruned_loss=0.08752, over 21341.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2924, pruned_loss=0.0718, over 4279128.97 frames. ], batch size: 548, lr: 4.26e-03, grad_scale: 16.0 2023-06-25 00:23:52,631 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.56 vs. limit=15.0 2023-06-25 00:24:15,773 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1215822.0, ans=0.125 2023-06-25 00:25:00,901 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.58 vs. limit=22.5 2023-06-25 00:25:19,712 INFO [train.py:996] (2/4) Epoch 7, batch 19700, loss[loss=0.2168, simple_loss=0.3131, pruned_loss=0.06019, over 21655.00 frames. ], tot_loss[loss=0.2214, simple_loss=0.2966, pruned_loss=0.07315, over 4281659.33 frames. ], batch size: 414, lr: 4.26e-03, grad_scale: 16.0 2023-06-25 00:25:33,498 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1216002.0, ans=0.125 2023-06-25 00:26:15,681 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1216122.0, ans=10.0 2023-06-25 00:26:20,591 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1216122.0, ans=0.1 2023-06-25 00:26:40,284 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.40 vs. limit=15.0 2023-06-25 00:26:54,001 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.233e+02 3.060e+02 3.533e+02 4.552e+02 9.773e+02, threshold=7.066e+02, percent-clipped=3.0 2023-06-25 00:27:14,660 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.70 vs. limit=15.0 2023-06-25 00:27:15,071 INFO [train.py:996] (2/4) Epoch 7, batch 19750, loss[loss=0.2541, simple_loss=0.3445, pruned_loss=0.08184, over 21775.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.306, pruned_loss=0.07387, over 4285249.23 frames. ], batch size: 332, lr: 4.26e-03, grad_scale: 16.0 2023-06-25 00:27:15,672 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1216302.0, ans=0.0 2023-06-25 00:27:34,346 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1216362.0, ans=0.0 2023-06-25 00:27:52,480 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1216422.0, ans=0.125 2023-06-25 00:28:40,632 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.10 vs. limit=15.0 2023-06-25 00:29:02,173 INFO [train.py:996] (2/4) Epoch 7, batch 19800, loss[loss=0.2319, simple_loss=0.3157, pruned_loss=0.07407, over 21538.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.3057, pruned_loss=0.07444, over 4288035.66 frames. ], batch size: 471, lr: 4.26e-03, grad_scale: 16.0 2023-06-25 00:30:30,871 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.982e+02 2.745e+02 3.353e+02 4.359e+02 1.129e+03, threshold=6.706e+02, percent-clipped=10.0 2023-06-25 00:30:49,378 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1216842.0, ans=10.0 2023-06-25 00:30:52,388 INFO [train.py:996] (2/4) Epoch 7, batch 19850, loss[loss=0.1492, simple_loss=0.2233, pruned_loss=0.03755, over 21788.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.2967, pruned_loss=0.06952, over 4291350.07 frames. ], batch size: 124, lr: 4.26e-03, grad_scale: 16.0 2023-06-25 00:31:06,982 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1216902.0, ans=0.2 2023-06-25 00:32:02,273 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 00:32:27,448 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.45 vs. limit=15.0 2023-06-25 00:32:39,683 INFO [train.py:996] (2/4) Epoch 7, batch 19900, loss[loss=0.1745, simple_loss=0.2504, pruned_loss=0.04931, over 21840.00 frames. ], tot_loss[loss=0.2156, simple_loss=0.2963, pruned_loss=0.06746, over 4283495.68 frames. ], batch size: 118, lr: 4.26e-03, grad_scale: 16.0 2023-06-25 00:32:46,161 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.78 vs. limit=15.0 2023-06-25 00:34:12,744 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.059e+02 2.818e+02 3.439e+02 4.122e+02 9.461e+02, threshold=6.879e+02, percent-clipped=3.0 2023-06-25 00:34:28,696 INFO [train.py:996] (2/4) Epoch 7, batch 19950, loss[loss=0.1847, simple_loss=0.2573, pruned_loss=0.05603, over 21618.00 frames. ], tot_loss[loss=0.2123, simple_loss=0.29, pruned_loss=0.06732, over 4284310.89 frames. ], batch size: 298, lr: 4.26e-03, grad_scale: 16.0 2023-06-25 00:34:37,280 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.82 vs. limit=22.5 2023-06-25 00:34:56,634 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.84 vs. limit=15.0 2023-06-25 00:34:58,962 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=18.07 vs. limit=22.5 2023-06-25 00:35:24,602 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.20 vs. limit=12.0 2023-06-25 00:35:38,040 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1217682.0, ans=0.125 2023-06-25 00:35:39,577 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1217682.0, ans=0.125 2023-06-25 00:35:40,343 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.80 vs. limit=15.0 2023-06-25 00:36:17,069 INFO [train.py:996] (2/4) Epoch 7, batch 20000, loss[loss=0.2171, simple_loss=0.3008, pruned_loss=0.06665, over 21739.00 frames. ], tot_loss[loss=0.2129, simple_loss=0.2913, pruned_loss=0.06732, over 4285681.67 frames. ], batch size: 282, lr: 4.26e-03, grad_scale: 32.0 2023-06-25 00:36:26,477 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1217802.0, ans=0.0 2023-06-25 00:37:14,984 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.75 vs. limit=15.0 2023-06-25 00:37:44,820 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1218042.0, ans=0.1 2023-06-25 00:37:47,461 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.230e+02 2.923e+02 3.292e+02 4.012e+02 7.608e+02, threshold=6.584e+02, percent-clipped=1.0 2023-06-25 00:38:03,224 INFO [train.py:996] (2/4) Epoch 7, batch 20050, loss[loss=0.2183, simple_loss=0.2951, pruned_loss=0.07077, over 21930.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2926, pruned_loss=0.06923, over 4283127.31 frames. ], batch size: 316, lr: 4.26e-03, grad_scale: 32.0 2023-06-25 00:38:35,362 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1218162.0, ans=0.125 2023-06-25 00:38:51,747 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=10.51 vs. limit=15.0 2023-06-25 00:39:29,077 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1218282.0, ans=0.1 2023-06-25 00:39:53,779 INFO [train.py:996] (2/4) Epoch 7, batch 20100, loss[loss=0.2185, simple_loss=0.2911, pruned_loss=0.07296, over 21431.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.2957, pruned_loss=0.07152, over 4290072.15 frames. ], batch size: 211, lr: 4.26e-03, grad_scale: 16.0 2023-06-25 00:40:45,831 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1218522.0, ans=0.0 2023-06-25 00:41:15,855 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1218582.0, ans=0.125 2023-06-25 00:41:29,512 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.560e+02 2.968e+02 3.649e+02 4.781e+02 8.701e+02, threshold=7.299e+02, percent-clipped=5.0 2023-06-25 00:41:49,534 INFO [train.py:996] (2/4) Epoch 7, batch 20150, loss[loss=0.2296, simple_loss=0.3007, pruned_loss=0.0793, over 21681.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.3064, pruned_loss=0.07488, over 4290861.90 frames. ], batch size: 263, lr: 4.26e-03, grad_scale: 16.0 2023-06-25 00:42:40,923 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.38 vs. limit=15.0 2023-06-25 00:43:51,478 INFO [train.py:996] (2/4) Epoch 7, batch 20200, loss[loss=0.2949, simple_loss=0.3856, pruned_loss=0.102, over 21531.00 frames. ], tot_loss[loss=0.236, simple_loss=0.3144, pruned_loss=0.07886, over 4291922.43 frames. ], batch size: 471, lr: 4.26e-03, grad_scale: 16.0 2023-06-25 00:43:59,617 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1219002.0, ans=0.0 2023-06-25 00:44:21,116 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1219062.0, ans=0.125 2023-06-25 00:45:05,420 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.45 vs. limit=15.0 2023-06-25 00:45:22,360 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.473e+02 3.331e+02 3.947e+02 5.099e+02 9.386e+02, threshold=7.894e+02, percent-clipped=7.0 2023-06-25 00:45:36,261 INFO [train.py:996] (2/4) Epoch 7, batch 20250, loss[loss=0.2517, simple_loss=0.3295, pruned_loss=0.08694, over 21547.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.3154, pruned_loss=0.07799, over 4290291.66 frames. ], batch size: 471, lr: 4.26e-03, grad_scale: 16.0 2023-06-25 00:47:11,759 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1219542.0, ans=0.2 2023-06-25 00:47:25,053 INFO [train.py:996] (2/4) Epoch 7, batch 20300, loss[loss=0.2264, simple_loss=0.3244, pruned_loss=0.06421, over 21258.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.3137, pruned_loss=0.07555, over 4286974.81 frames. ], batch size: 548, lr: 4.26e-03, grad_scale: 16.0 2023-06-25 00:48:34,597 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.48 vs. limit=10.0 2023-06-25 00:48:52,970 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.137e+02 2.615e+02 3.044e+02 3.787e+02 8.411e+02, threshold=6.088e+02, percent-clipped=1.0 2023-06-25 00:49:11,893 INFO [train.py:996] (2/4) Epoch 7, batch 20350, loss[loss=0.2269, simple_loss=0.2992, pruned_loss=0.07729, over 20925.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3127, pruned_loss=0.07566, over 4276969.17 frames. ], batch size: 607, lr: 4.25e-03, grad_scale: 16.0 2023-06-25 00:49:14,216 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1219902.0, ans=0.125 2023-06-25 00:49:30,427 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1219902.0, ans=0.125 2023-06-25 00:50:05,759 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.68 vs. limit=22.5 2023-06-25 00:50:09,154 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1220082.0, ans=0.0 2023-06-25 00:50:53,670 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1220142.0, ans=0.1 2023-06-25 00:50:56,354 INFO [train.py:996] (2/4) Epoch 7, batch 20400, loss[loss=0.288, simple_loss=0.3624, pruned_loss=0.1068, over 21647.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3134, pruned_loss=0.07729, over 4263255.11 frames. ], batch size: 414, lr: 4.25e-03, grad_scale: 32.0 2023-06-25 00:51:22,226 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 00:51:54,505 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 00:52:32,836 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.319e+02 3.347e+02 3.963e+02 4.819e+02 8.468e+02, threshold=7.927e+02, percent-clipped=6.0 2023-06-25 00:52:44,836 INFO [train.py:996] (2/4) Epoch 7, batch 20450, loss[loss=0.2056, simple_loss=0.2778, pruned_loss=0.06673, over 21930.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.3149, pruned_loss=0.08021, over 4261891.73 frames. ], batch size: 316, lr: 4.25e-03, grad_scale: 16.0 2023-06-25 00:52:45,307 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1220502.0, ans=0.2 2023-06-25 00:52:58,476 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1220502.0, ans=0.07 2023-06-25 00:53:25,601 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1220622.0, ans=0.125 2023-06-25 00:53:28,912 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1220622.0, ans=0.0 2023-06-25 00:53:39,205 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 00:53:54,610 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1220682.0, ans=0.0 2023-06-25 00:54:08,828 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1220742.0, ans=0.5 2023-06-25 00:54:25,825 INFO [train.py:996] (2/4) Epoch 7, batch 20500, loss[loss=0.2552, simple_loss=0.2956, pruned_loss=0.1074, over 21564.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3097, pruned_loss=0.07994, over 4265404.33 frames. ], batch size: 508, lr: 4.25e-03, grad_scale: 16.0 2023-06-25 00:54:26,536 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1220802.0, ans=0.125 2023-06-25 00:54:42,994 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1220802.0, ans=0.015 2023-06-25 00:56:00,861 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.376e+02 3.204e+02 4.054e+02 5.426e+02 8.867e+02, threshold=8.109e+02, percent-clipped=2.0 2023-06-25 00:56:04,617 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1221042.0, ans=0.125 2023-06-25 00:56:13,076 INFO [train.py:996] (2/4) Epoch 7, batch 20550, loss[loss=0.2273, simple_loss=0.322, pruned_loss=0.0663, over 21615.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3028, pruned_loss=0.07804, over 4257650.01 frames. ], batch size: 389, lr: 4.25e-03, grad_scale: 16.0 2023-06-25 00:56:34,954 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1221162.0, ans=0.125 2023-06-25 00:56:44,656 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.33 vs. limit=15.0 2023-06-25 00:56:54,995 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.99 vs. limit=6.0 2023-06-25 00:57:56,522 INFO [train.py:996] (2/4) Epoch 7, batch 20600, loss[loss=0.2177, simple_loss=0.3014, pruned_loss=0.06706, over 21229.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.3024, pruned_loss=0.07569, over 4249532.85 frames. ], batch size: 176, lr: 4.25e-03, grad_scale: 16.0 2023-06-25 00:58:07,965 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1221402.0, ans=0.0 2023-06-25 00:58:13,180 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.49 vs. limit=15.0 2023-06-25 00:58:24,951 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.20 vs. limit=22.5 2023-06-25 00:58:29,208 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1221462.0, ans=0.125 2023-06-25 00:58:34,630 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1221522.0, ans=0.2 2023-06-25 00:58:49,988 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1221522.0, ans=0.125 2023-06-25 00:59:25,529 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.194e+02 3.095e+02 3.828e+02 5.103e+02 1.106e+03, threshold=7.656e+02, percent-clipped=7.0 2023-06-25 00:59:34,679 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1221642.0, ans=0.05 2023-06-25 00:59:37,772 INFO [train.py:996] (2/4) Epoch 7, batch 20650, loss[loss=0.2185, simple_loss=0.29, pruned_loss=0.07351, over 21241.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.2999, pruned_loss=0.07615, over 4249456.34 frames. ], batch size: 143, lr: 4.25e-03, grad_scale: 16.0 2023-06-25 00:59:38,260 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1221702.0, ans=0.1 2023-06-25 00:59:48,933 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.19 vs. limit=6.0 2023-06-25 00:59:52,495 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.65 vs. limit=22.5 2023-06-25 00:59:57,486 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1221702.0, ans=0.125 2023-06-25 01:00:59,781 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.09 vs. limit=22.5 2023-06-25 01:01:06,161 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1221882.0, ans=0.125 2023-06-25 01:01:06,820 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.30 vs. limit=15.0 2023-06-25 01:01:32,852 INFO [train.py:996] (2/4) Epoch 7, batch 20700, loss[loss=0.1537, simple_loss=0.2294, pruned_loss=0.03901, over 21326.00 frames. ], tot_loss[loss=0.219, simple_loss=0.2926, pruned_loss=0.07269, over 4250099.01 frames. ], batch size: 176, lr: 4.25e-03, grad_scale: 16.0 2023-06-25 01:01:47,344 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1222002.0, ans=0.125 2023-06-25 01:01:49,497 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.60 vs. limit=6.0 2023-06-25 01:02:05,157 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1222062.0, ans=0.0 2023-06-25 01:02:08,918 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1222122.0, ans=0.2 2023-06-25 01:02:18,554 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1222122.0, ans=0.125 2023-06-25 01:02:51,094 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1222182.0, ans=0.0 2023-06-25 01:02:54,809 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1222182.0, ans=0.1 2023-06-25 01:02:54,840 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1222182.0, ans=0.125 2023-06-25 01:03:06,670 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.957e+02 2.936e+02 3.801e+02 5.565e+02 1.085e+03, threshold=7.602e+02, percent-clipped=14.0 2023-06-25 01:03:21,550 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1222242.0, ans=0.125 2023-06-25 01:03:23,606 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.84 vs. limit=10.0 2023-06-25 01:03:24,058 INFO [train.py:996] (2/4) Epoch 7, batch 20750, loss[loss=0.2202, simple_loss=0.3055, pruned_loss=0.06743, over 21212.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.2923, pruned_loss=0.07176, over 4250745.21 frames. ], batch size: 159, lr: 4.25e-03, grad_scale: 16.0 2023-06-25 01:04:13,284 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1222422.0, ans=0.1 2023-06-25 01:04:23,113 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1222422.0, ans=0.07 2023-06-25 01:05:07,447 INFO [train.py:996] (2/4) Epoch 7, batch 20800, loss[loss=0.2884, simple_loss=0.3642, pruned_loss=0.1063, over 21403.00 frames. ], tot_loss[loss=0.2212, simple_loss=0.2968, pruned_loss=0.07276, over 4259059.23 frames. ], batch size: 471, lr: 4.25e-03, grad_scale: 32.0 2023-06-25 01:05:08,168 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1222602.0, ans=0.0 2023-06-25 01:05:11,856 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1222602.0, ans=0.0 2023-06-25 01:05:15,543 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1222602.0, ans=0.125 2023-06-25 01:05:17,199 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1222602.0, ans=0.0 2023-06-25 01:05:26,238 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1222662.0, ans=0.1 2023-06-25 01:05:32,701 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1222662.0, ans=0.2 2023-06-25 01:06:07,184 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1222722.0, ans=0.1 2023-06-25 01:06:23,633 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1222782.0, ans=0.125 2023-06-25 01:06:25,393 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1222782.0, ans=0.1 2023-06-25 01:06:37,743 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1222842.0, ans=0.0 2023-06-25 01:06:38,177 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.83 vs. limit=15.0 2023-06-25 01:06:40,842 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1222842.0, ans=0.125 2023-06-25 01:06:43,833 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.363e+02 3.312e+02 4.339e+02 6.808e+02 1.439e+03, threshold=8.678e+02, percent-clipped=19.0 2023-06-25 01:06:55,814 INFO [train.py:996] (2/4) Epoch 7, batch 20850, loss[loss=0.2016, simple_loss=0.267, pruned_loss=0.06813, over 21136.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2912, pruned_loss=0.07095, over 4254608.44 frames. ], batch size: 608, lr: 4.25e-03, grad_scale: 32.0 2023-06-25 01:07:02,953 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1222902.0, ans=0.1 2023-06-25 01:07:04,692 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1222902.0, ans=0.2 2023-06-25 01:07:33,342 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1222962.0, ans=0.0 2023-06-25 01:08:05,613 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1223082.0, ans=0.1 2023-06-25 01:08:19,615 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 01:08:35,043 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1223142.0, ans=0.125 2023-06-25 01:08:36,568 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1223142.0, ans=0.0 2023-06-25 01:08:44,703 INFO [train.py:996] (2/4) Epoch 7, batch 20900, loss[loss=0.2466, simple_loss=0.309, pruned_loss=0.09211, over 21747.00 frames. ], tot_loss[loss=0.2183, simple_loss=0.293, pruned_loss=0.0718, over 4256772.43 frames. ], batch size: 441, lr: 4.25e-03, grad_scale: 32.0 2023-06-25 01:08:52,315 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.07 vs. limit=10.0 2023-06-25 01:09:05,302 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1223262.0, ans=0.0 2023-06-25 01:09:20,562 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1223262.0, ans=0.125 2023-06-25 01:10:03,263 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1223382.0, ans=0.125 2023-06-25 01:10:19,932 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.054e+02 2.894e+02 3.467e+02 4.402e+02 7.475e+02, threshold=6.935e+02, percent-clipped=1.0 2023-06-25 01:10:20,744 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1223442.0, ans=0.125 2023-06-25 01:10:26,241 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1223442.0, ans=0.125 2023-06-25 01:10:30,287 INFO [train.py:996] (2/4) Epoch 7, batch 20950, loss[loss=0.1724, simple_loss=0.2463, pruned_loss=0.0492, over 21942.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2887, pruned_loss=0.06899, over 4255486.49 frames. ], batch size: 98, lr: 4.25e-03, grad_scale: 16.0 2023-06-25 01:12:09,740 INFO [train.py:996] (2/4) Epoch 7, batch 21000, loss[loss=0.2247, simple_loss=0.2989, pruned_loss=0.0753, over 21839.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.2871, pruned_loss=0.06888, over 4267959.48 frames. ], batch size: 333, lr: 4.25e-03, grad_scale: 16.0 2023-06-25 01:12:09,741 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-25 01:12:27,629 INFO [train.py:1028] (2/4) Epoch 7, validation: loss=0.2666, simple_loss=0.3633, pruned_loss=0.08493, over 1796401.00 frames. 2023-06-25 01:12:27,630 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 23731MB 2023-06-25 01:12:53,825 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1223862.0, ans=0.0 2023-06-25 01:13:57,665 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.60 vs. limit=15.0 2023-06-25 01:14:06,884 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.137e+02 2.703e+02 3.087e+02 3.976e+02 6.503e+02, threshold=6.174e+02, percent-clipped=0.0 2023-06-25 01:14:14,217 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1224042.0, ans=0.1 2023-06-25 01:14:17,190 INFO [train.py:996] (2/4) Epoch 7, batch 21050, loss[loss=0.1999, simple_loss=0.2618, pruned_loss=0.06902, over 21498.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.2849, pruned_loss=0.06893, over 4267092.35 frames. ], batch size: 230, lr: 4.25e-03, grad_scale: 16.0 2023-06-25 01:14:33,684 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1224162.0, ans=0.2 2023-06-25 01:14:54,927 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1224222.0, ans=0.1 2023-06-25 01:14:56,489 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1224222.0, ans=0.125 2023-06-25 01:16:05,183 INFO [train.py:996] (2/4) Epoch 7, batch 21100, loss[loss=0.1885, simple_loss=0.2635, pruned_loss=0.05675, over 21726.00 frames. ], tot_loss[loss=0.2085, simple_loss=0.2804, pruned_loss=0.06828, over 4255822.39 frames. ], batch size: 371, lr: 4.25e-03, grad_scale: 16.0 2023-06-25 01:16:39,867 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1224462.0, ans=0.0 2023-06-25 01:16:39,911 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1224462.0, ans=0.1 2023-06-25 01:17:02,947 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1224522.0, ans=0.2 2023-06-25 01:17:24,853 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1224582.0, ans=0.1 2023-06-25 01:17:42,063 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.130e+02 2.657e+02 3.143e+02 4.101e+02 9.163e+02, threshold=6.287e+02, percent-clipped=4.0 2023-06-25 01:17:52,642 INFO [train.py:996] (2/4) Epoch 7, batch 21150, loss[loss=0.2021, simple_loss=0.269, pruned_loss=0.06764, over 21777.00 frames. ], tot_loss[loss=0.2077, simple_loss=0.2781, pruned_loss=0.06864, over 4249434.51 frames. ], batch size: 124, lr: 4.25e-03, grad_scale: 16.0 2023-06-25 01:18:06,641 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1224702.0, ans=0.125 2023-06-25 01:19:07,430 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.39 vs. limit=15.0 2023-06-25 01:19:39,304 INFO [train.py:996] (2/4) Epoch 7, batch 21200, loss[loss=0.1883, simple_loss=0.2642, pruned_loss=0.05622, over 21637.00 frames. ], tot_loss[loss=0.2042, simple_loss=0.2734, pruned_loss=0.0675, over 4254500.22 frames. ], batch size: 332, lr: 4.25e-03, grad_scale: 32.0 2023-06-25 01:19:59,942 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1225062.0, ans=0.1 2023-06-25 01:20:22,442 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1225062.0, ans=0.125 2023-06-25 01:20:44,547 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 01:21:16,844 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1225242.0, ans=0.125 2023-06-25 01:21:17,800 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.018e+02 2.659e+02 3.125e+02 3.870e+02 6.186e+02, threshold=6.250e+02, percent-clipped=0.0 2023-06-25 01:21:28,370 INFO [train.py:996] (2/4) Epoch 7, batch 21250, loss[loss=0.2551, simple_loss=0.3401, pruned_loss=0.08505, over 21701.00 frames. ], tot_loss[loss=0.2033, simple_loss=0.272, pruned_loss=0.06726, over 4264747.05 frames. ], batch size: 391, lr: 4.25e-03, grad_scale: 32.0 2023-06-25 01:21:38,006 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.58 vs. limit=12.0 2023-06-25 01:23:10,937 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1225542.0, ans=0.125 2023-06-25 01:23:15,833 INFO [train.py:996] (2/4) Epoch 7, batch 21300, loss[loss=0.2268, simple_loss=0.2963, pruned_loss=0.07866, over 21926.00 frames. ], tot_loss[loss=0.2076, simple_loss=0.2775, pruned_loss=0.06889, over 4258114.56 frames. ], batch size: 333, lr: 4.25e-03, grad_scale: 32.0 2023-06-25 01:23:23,204 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1225602.0, ans=0.125 2023-06-25 01:23:29,430 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.72 vs. limit=15.0 2023-06-25 01:24:19,609 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1225722.0, ans=0.125 2023-06-25 01:24:55,313 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.338e+02 2.894e+02 3.300e+02 4.575e+02 9.382e+02, threshold=6.600e+02, percent-clipped=9.0 2023-06-25 01:25:04,017 INFO [train.py:996] (2/4) Epoch 7, batch 21350, loss[loss=0.2177, simple_loss=0.3093, pruned_loss=0.06305, over 21633.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.2813, pruned_loss=0.06955, over 4253797.64 frames. ], batch size: 389, lr: 4.24e-03, grad_scale: 16.0 2023-06-25 01:25:05,097 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.45 vs. limit=15.0 2023-06-25 01:25:15,712 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.05 vs. limit=10.0 2023-06-25 01:25:35,522 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1225962.0, ans=0.125 2023-06-25 01:26:00,135 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1226022.0, ans=0.125 2023-06-25 01:26:03,209 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1226022.0, ans=0.125 2023-06-25 01:26:51,939 INFO [train.py:996] (2/4) Epoch 7, batch 21400, loss[loss=0.2181, simple_loss=0.2999, pruned_loss=0.06812, over 21935.00 frames. ], tot_loss[loss=0.2117, simple_loss=0.285, pruned_loss=0.06921, over 4263474.18 frames. ], batch size: 316, lr: 4.24e-03, grad_scale: 16.0 2023-06-25 01:28:00,283 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1226382.0, ans=0.1 2023-06-25 01:28:23,653 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1226442.0, ans=0.125 2023-06-25 01:28:31,712 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.113e+02 3.088e+02 4.012e+02 5.119e+02 7.296e+02, threshold=8.024e+02, percent-clipped=4.0 2023-06-25 01:28:40,327 INFO [train.py:996] (2/4) Epoch 7, batch 21450, loss[loss=0.2575, simple_loss=0.3284, pruned_loss=0.09328, over 21331.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2894, pruned_loss=0.07096, over 4271766.99 frames. ], batch size: 548, lr: 4.24e-03, grad_scale: 16.0 2023-06-25 01:29:15,548 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1226562.0, ans=0.125 2023-06-25 01:30:28,720 INFO [train.py:996] (2/4) Epoch 7, batch 21500, loss[loss=0.1891, simple_loss=0.2525, pruned_loss=0.06288, over 21401.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2897, pruned_loss=0.07144, over 4259662.21 frames. ], batch size: 194, lr: 4.24e-03, grad_scale: 16.0 2023-06-25 01:31:04,714 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1226862.0, ans=0.0 2023-06-25 01:31:28,723 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1226922.0, ans=0.0 2023-06-25 01:31:43,374 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1226982.0, ans=0.125 2023-06-25 01:31:43,886 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.90 vs. limit=15.0 2023-06-25 01:31:47,029 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.59 vs. limit=15.0 2023-06-25 01:32:06,011 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.254e+02 2.889e+02 3.383e+02 4.228e+02 8.142e+02, threshold=6.766e+02, percent-clipped=1.0 2023-06-25 01:32:13,686 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1227102.0, ans=0.5 2023-06-25 01:32:14,649 INFO [train.py:996] (2/4) Epoch 7, batch 21550, loss[loss=0.1897, simple_loss=0.257, pruned_loss=0.06114, over 21697.00 frames. ], tot_loss[loss=0.212, simple_loss=0.285, pruned_loss=0.06955, over 4262883.67 frames. ], batch size: 112, lr: 4.24e-03, grad_scale: 16.0 2023-06-25 01:32:24,327 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1227102.0, ans=0.2 2023-06-25 01:32:30,661 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1227162.0, ans=0.07 2023-06-25 01:32:33,200 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.67 vs. limit=22.5 2023-06-25 01:32:45,916 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.02 vs. limit=15.0 2023-06-25 01:33:12,039 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1227222.0, ans=0.125 2023-06-25 01:33:57,990 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1227402.0, ans=0.0 2023-06-25 01:33:59,044 INFO [train.py:996] (2/4) Epoch 7, batch 21600, loss[loss=0.1659, simple_loss=0.2451, pruned_loss=0.04335, over 21198.00 frames. ], tot_loss[loss=0.2083, simple_loss=0.28, pruned_loss=0.06834, over 4261481.52 frames. ], batch size: 548, lr: 4.24e-03, grad_scale: 32.0 2023-06-25 01:35:25,158 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1227582.0, ans=0.1 2023-06-25 01:35:40,101 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.017e+02 2.809e+02 3.415e+02 4.856e+02 1.279e+03, threshold=6.830e+02, percent-clipped=8.0 2023-06-25 01:35:53,451 INFO [train.py:996] (2/4) Epoch 7, batch 21650, loss[loss=0.2132, simple_loss=0.2938, pruned_loss=0.06631, over 21182.00 frames. ], tot_loss[loss=0.2079, simple_loss=0.2832, pruned_loss=0.06629, over 4270470.70 frames. ], batch size: 143, lr: 4.24e-03, grad_scale: 16.0 2023-06-25 01:36:57,481 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1227822.0, ans=0.125 2023-06-25 01:37:09,714 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1227882.0, ans=0.2 2023-06-25 01:37:11,428 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1227882.0, ans=0.1 2023-06-25 01:37:34,939 INFO [train.py:996] (2/4) Epoch 7, batch 21700, loss[loss=0.1917, simple_loss=0.2543, pruned_loss=0.06452, over 21481.00 frames. ], tot_loss[loss=0.2065, simple_loss=0.2829, pruned_loss=0.06501, over 4271189.24 frames. ], batch size: 195, lr: 4.24e-03, grad_scale: 16.0 2023-06-25 01:38:05,625 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1228062.0, ans=0.1 2023-06-25 01:38:40,398 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1228122.0, ans=0.0 2023-06-25 01:38:45,274 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1228182.0, ans=0.125 2023-06-25 01:38:49,177 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1228182.0, ans=0.125 2023-06-25 01:39:14,326 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.145e+02 3.013e+02 3.692e+02 5.814e+02 1.203e+03, threshold=7.384e+02, percent-clipped=13.0 2023-06-25 01:39:20,698 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.05 vs. limit=15.0 2023-06-25 01:39:20,989 INFO [train.py:996] (2/4) Epoch 7, batch 21750, loss[loss=0.188, simple_loss=0.251, pruned_loss=0.06248, over 21693.00 frames. ], tot_loss[loss=0.2048, simple_loss=0.2793, pruned_loss=0.06521, over 4273924.87 frames. ], batch size: 299, lr: 4.24e-03, grad_scale: 16.0 2023-06-25 01:40:00,882 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=4.55 vs. limit=12.0 2023-06-25 01:40:26,555 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1228422.0, ans=0.0 2023-06-25 01:41:08,621 INFO [train.py:996] (2/4) Epoch 7, batch 21800, loss[loss=0.207, simple_loss=0.2844, pruned_loss=0.06487, over 21535.00 frames. ], tot_loss[loss=0.2056, simple_loss=0.2782, pruned_loss=0.06645, over 4250576.18 frames. ], batch size: 230, lr: 4.24e-03, grad_scale: 16.0 2023-06-25 01:41:09,556 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1228602.0, ans=0.125 2023-06-25 01:41:44,689 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1228662.0, ans=0.125 2023-06-25 01:42:07,834 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1228722.0, ans=0.2 2023-06-25 01:42:39,790 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1228842.0, ans=0.1 2023-06-25 01:42:45,929 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.478e+02 3.194e+02 4.069e+02 5.190e+02 9.750e+02, threshold=8.138e+02, percent-clipped=3.0 2023-06-25 01:42:49,948 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1228842.0, ans=0.0 2023-06-25 01:42:53,054 INFO [train.py:996] (2/4) Epoch 7, batch 21850, loss[loss=0.2068, simple_loss=0.2796, pruned_loss=0.06694, over 21830.00 frames. ], tot_loss[loss=0.2093, simple_loss=0.2849, pruned_loss=0.06682, over 4255907.58 frames. ], batch size: 282, lr: 4.24e-03, grad_scale: 16.0 2023-06-25 01:43:24,507 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.04 vs. limit=15.0 2023-06-25 01:43:57,607 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1229022.0, ans=0.125 2023-06-25 01:44:00,923 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1229022.0, ans=0.125 2023-06-25 01:44:19,829 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1229082.0, ans=0.1 2023-06-25 01:44:30,007 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1229142.0, ans=0.0 2023-06-25 01:44:44,841 INFO [train.py:996] (2/4) Epoch 7, batch 21900, loss[loss=0.1953, simple_loss=0.2624, pruned_loss=0.06406, over 21706.00 frames. ], tot_loss[loss=0.2101, simple_loss=0.2844, pruned_loss=0.06791, over 4260858.34 frames. ], batch size: 264, lr: 4.24e-03, grad_scale: 16.0 2023-06-25 01:45:04,115 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1229202.0, ans=0.0 2023-06-25 01:45:42,802 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.41 vs. limit=12.0 2023-06-25 01:45:54,117 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1229382.0, ans=0.1 2023-06-25 01:46:19,672 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.343e+02 2.991e+02 3.581e+02 4.789e+02 1.002e+03, threshold=7.161e+02, percent-clipped=1.0 2023-06-25 01:46:31,073 INFO [train.py:996] (2/4) Epoch 7, batch 21950, loss[loss=0.1929, simple_loss=0.2554, pruned_loss=0.06515, over 21512.00 frames. ], tot_loss[loss=0.207, simple_loss=0.2796, pruned_loss=0.06721, over 4271349.69 frames. ], batch size: 195, lr: 4.24e-03, grad_scale: 16.0 2023-06-25 01:46:38,500 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1229502.0, ans=0.125 2023-06-25 01:47:07,596 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1229562.0, ans=0.0 2023-06-25 01:47:47,101 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.60 vs. limit=15.0 2023-06-25 01:48:26,803 INFO [train.py:996] (2/4) Epoch 7, batch 22000, loss[loss=0.1839, simple_loss=0.259, pruned_loss=0.05442, over 21609.00 frames. ], tot_loss[loss=0.2023, simple_loss=0.274, pruned_loss=0.06525, over 4264654.20 frames. ], batch size: 247, lr: 4.24e-03, grad_scale: 32.0 2023-06-25 01:48:37,000 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.36 vs. limit=15.0 2023-06-25 01:48:41,726 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1229802.0, ans=0.0 2023-06-25 01:49:34,402 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1229982.0, ans=0.125 2023-06-25 01:49:39,924 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1229982.0, ans=0.125 2023-06-25 01:50:03,974 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1230042.0, ans=0.035 2023-06-25 01:50:07,791 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1230042.0, ans=0.0 2023-06-25 01:50:11,182 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1230042.0, ans=0.2 2023-06-25 01:50:12,195 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.130e+02 3.193e+02 3.853e+02 5.102e+02 1.201e+03, threshold=7.707e+02, percent-clipped=7.0 2023-06-25 01:50:17,730 INFO [train.py:996] (2/4) Epoch 7, batch 22050, loss[loss=0.2279, simple_loss=0.3127, pruned_loss=0.07152, over 21559.00 frames. ], tot_loss[loss=0.2055, simple_loss=0.2788, pruned_loss=0.0661, over 4257499.31 frames. ], batch size: 230, lr: 4.24e-03, grad_scale: 16.0 2023-06-25 01:50:23,706 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1230102.0, ans=0.125 2023-06-25 01:50:51,831 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.56 vs. limit=12.0 2023-06-25 01:50:58,509 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1230162.0, ans=0.0 2023-06-25 01:51:55,446 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1230342.0, ans=0.125 2023-06-25 01:52:06,970 INFO [train.py:996] (2/4) Epoch 7, batch 22100, loss[loss=0.2239, simple_loss=0.2899, pruned_loss=0.07895, over 21801.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.2895, pruned_loss=0.07111, over 4256141.88 frames. ], batch size: 247, lr: 4.24e-03, grad_scale: 16.0 2023-06-25 01:52:07,316 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1230402.0, ans=0.125 2023-06-25 01:52:10,793 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1230402.0, ans=0.0 2023-06-25 01:52:14,425 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1230402.0, ans=0.0 2023-06-25 01:53:06,301 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1230522.0, ans=0.125 2023-06-25 01:53:25,308 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.86 vs. limit=15.0 2023-06-25 01:53:49,135 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.658e+02 3.415e+02 4.118e+02 5.475e+02 8.069e+02, threshold=8.235e+02, percent-clipped=4.0 2023-06-25 01:53:54,210 INFO [train.py:996] (2/4) Epoch 7, batch 22150, loss[loss=0.26, simple_loss=0.3092, pruned_loss=0.1054, over 21756.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.2925, pruned_loss=0.0725, over 4260180.08 frames. ], batch size: 508, lr: 4.24e-03, grad_scale: 16.0 2023-06-25 01:53:54,732 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1230702.0, ans=0.125 2023-06-25 01:55:02,948 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.43 vs. limit=12.0 2023-06-25 01:55:19,401 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1230942.0, ans=0.0 2023-06-25 01:55:35,163 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1230942.0, ans=0.0 2023-06-25 01:55:35,905 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.31 vs. limit=10.0 2023-06-25 01:55:41,160 INFO [train.py:996] (2/4) Epoch 7, batch 22200, loss[loss=0.2492, simple_loss=0.3423, pruned_loss=0.07808, over 21869.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.2946, pruned_loss=0.07308, over 4270292.14 frames. ], batch size: 371, lr: 4.24e-03, grad_scale: 16.0 2023-06-25 01:55:42,549 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.68 vs. limit=15.0 2023-06-25 01:56:16,608 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.96 vs. limit=22.5 2023-06-25 01:56:20,101 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=12.17 vs. limit=15.0 2023-06-25 01:57:01,660 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1231182.0, ans=0.05 2023-06-25 01:57:25,293 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.355e+02 3.120e+02 3.891e+02 5.411e+02 1.488e+03, threshold=7.782e+02, percent-clipped=8.0 2023-06-25 01:57:31,143 INFO [train.py:996] (2/4) Epoch 7, batch 22250, loss[loss=0.2233, simple_loss=0.2917, pruned_loss=0.07747, over 21278.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.3014, pruned_loss=0.07547, over 4275410.78 frames. ], batch size: 176, lr: 4.24e-03, grad_scale: 16.0 2023-06-25 01:57:45,439 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1231302.0, ans=0.1 2023-06-25 01:59:13,143 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1231542.0, ans=0.125 2023-06-25 01:59:18,363 INFO [train.py:996] (2/4) Epoch 7, batch 22300, loss[loss=0.2314, simple_loss=0.2962, pruned_loss=0.08333, over 21928.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.3025, pruned_loss=0.07693, over 4278431.11 frames. ], batch size: 333, lr: 4.23e-03, grad_scale: 16.0 2023-06-25 02:00:05,910 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.68 vs. limit=15.0 2023-06-25 02:01:00,004 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.515e+02 3.143e+02 3.997e+02 5.587e+02 8.969e+02, threshold=7.995e+02, percent-clipped=6.0 2023-06-25 02:01:10,876 INFO [train.py:996] (2/4) Epoch 7, batch 22350, loss[loss=0.2408, simple_loss=0.3177, pruned_loss=0.08196, over 20111.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.301, pruned_loss=0.07782, over 4287317.31 frames. ], batch size: 703, lr: 4.23e-03, grad_scale: 16.0 2023-06-25 02:01:14,820 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1231902.0, ans=0.0 2023-06-25 02:01:15,501 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.84 vs. limit=15.0 2023-06-25 02:01:31,069 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1231902.0, ans=0.1 2023-06-25 02:01:38,104 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1231962.0, ans=0.125 2023-06-25 02:01:40,147 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1231962.0, ans=0.2 2023-06-25 02:02:06,711 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.23 vs. limit=12.0 2023-06-25 02:02:24,965 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1232082.0, ans=0.125 2023-06-25 02:02:59,816 INFO [train.py:996] (2/4) Epoch 7, batch 22400, loss[loss=0.2024, simple_loss=0.2678, pruned_loss=0.06846, over 21310.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.2966, pruned_loss=0.07408, over 4288910.76 frames. ], batch size: 177, lr: 4.23e-03, grad_scale: 32.0 2023-06-25 02:03:06,975 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=13.20 vs. limit=15.0 2023-06-25 02:03:16,152 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1232202.0, ans=0.125 2023-06-25 02:03:42,606 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1232322.0, ans=0.125 2023-06-25 02:04:32,748 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1232442.0, ans=0.125 2023-06-25 02:04:42,726 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.236e+02 2.737e+02 3.177e+02 4.252e+02 6.969e+02, threshold=6.354e+02, percent-clipped=0.0 2023-06-25 02:04:48,418 INFO [train.py:996] (2/4) Epoch 7, batch 22450, loss[loss=0.2097, simple_loss=0.2772, pruned_loss=0.07113, over 21808.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.2904, pruned_loss=0.07335, over 4282077.26 frames. ], batch size: 118, lr: 4.23e-03, grad_scale: 32.0 2023-06-25 02:05:28,715 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.43 vs. limit=15.0 2023-06-25 02:06:43,925 INFO [train.py:996] (2/4) Epoch 7, batch 22500, loss[loss=0.2087, simple_loss=0.314, pruned_loss=0.05169, over 20837.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2866, pruned_loss=0.07238, over 4272203.48 frames. ], batch size: 607, lr: 4.23e-03, grad_scale: 32.0 2023-06-25 02:07:39,323 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1232922.0, ans=0.125 2023-06-25 02:08:02,880 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 02:08:08,829 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1233042.0, ans=0.125 2023-06-25 02:08:12,391 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 02:08:22,747 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.325e+02 2.995e+02 3.831e+02 4.510e+02 7.998e+02, threshold=7.663e+02, percent-clipped=9.0 2023-06-25 02:08:32,972 INFO [train.py:996] (2/4) Epoch 7, batch 22550, loss[loss=0.2222, simple_loss=0.2972, pruned_loss=0.07355, over 21301.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.2918, pruned_loss=0.07287, over 4281256.29 frames. ], batch size: 176, lr: 4.23e-03, grad_scale: 32.0 2023-06-25 02:08:37,856 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.52 vs. limit=6.0 2023-06-25 02:09:37,700 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1233222.0, ans=0.1 2023-06-25 02:10:25,389 INFO [train.py:996] (2/4) Epoch 7, batch 22600, loss[loss=0.2237, simple_loss=0.3085, pruned_loss=0.06938, over 21744.00 frames. ], tot_loss[loss=0.2192, simple_loss=0.2928, pruned_loss=0.07282, over 4280386.04 frames. ], batch size: 298, lr: 4.23e-03, grad_scale: 16.0 2023-06-25 02:12:10,427 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.462e+02 3.222e+02 3.850e+02 5.288e+02 1.031e+03, threshold=7.700e+02, percent-clipped=4.0 2023-06-25 02:12:14,434 INFO [train.py:996] (2/4) Epoch 7, batch 22650, loss[loss=0.201, simple_loss=0.2794, pruned_loss=0.0613, over 21641.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.2916, pruned_loss=0.07345, over 4277131.59 frames. ], batch size: 263, lr: 4.23e-03, grad_scale: 16.0 2023-06-25 02:12:38,925 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1233702.0, ans=0.1 2023-06-25 02:13:49,869 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1233942.0, ans=0.0 2023-06-25 02:14:01,956 INFO [train.py:996] (2/4) Epoch 7, batch 22700, loss[loss=0.2032, simple_loss=0.2758, pruned_loss=0.06537, over 21810.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2857, pruned_loss=0.07249, over 4269966.18 frames. ], batch size: 317, lr: 4.23e-03, grad_scale: 16.0 2023-06-25 02:15:06,919 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1234122.0, ans=0.0 2023-06-25 02:15:20,889 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1234182.0, ans=0.125 2023-06-25 02:15:46,862 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.360e+02 3.300e+02 4.052e+02 5.642e+02 1.079e+03, threshold=8.104e+02, percent-clipped=7.0 2023-06-25 02:15:49,905 INFO [train.py:996] (2/4) Epoch 7, batch 22750, loss[loss=0.2939, simple_loss=0.355, pruned_loss=0.1164, over 21824.00 frames. ], tot_loss[loss=0.219, simple_loss=0.2893, pruned_loss=0.07433, over 4270505.25 frames. ], batch size: 124, lr: 4.23e-03, grad_scale: 16.0 2023-06-25 02:16:27,946 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1234362.0, ans=0.1 2023-06-25 02:16:33,084 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1234422.0, ans=0.0 2023-06-25 02:16:57,881 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1234482.0, ans=0.1 2023-06-25 02:17:36,767 INFO [train.py:996] (2/4) Epoch 7, batch 22800, loss[loss=0.2102, simple_loss=0.2912, pruned_loss=0.0646, over 21840.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.293, pruned_loss=0.07623, over 4269337.69 frames. ], batch size: 124, lr: 4.23e-03, grad_scale: 32.0 2023-06-25 02:17:59,353 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1234602.0, ans=0.2 2023-06-25 02:19:03,277 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1234782.0, ans=0.125 2023-06-25 02:19:08,612 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1234842.0, ans=0.125 2023-06-25 02:19:23,106 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.424e+02 3.142e+02 3.789e+02 4.718e+02 7.259e+02, threshold=7.578e+02, percent-clipped=0.0 2023-06-25 02:19:25,139 INFO [train.py:996] (2/4) Epoch 7, batch 22850, loss[loss=0.1729, simple_loss=0.2501, pruned_loss=0.04784, over 19862.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.2892, pruned_loss=0.07552, over 4270239.64 frames. ], batch size: 703, lr: 4.23e-03, grad_scale: 16.0 2023-06-25 02:20:15,214 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.06 vs. limit=15.0 2023-06-25 02:20:25,455 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.10 vs. limit=10.0 2023-06-25 02:20:50,254 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1235142.0, ans=0.125 2023-06-25 02:20:52,159 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1235142.0, ans=0.125 2023-06-25 02:21:04,504 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1235142.0, ans=0.125 2023-06-25 02:21:09,530 INFO [train.py:996] (2/4) Epoch 7, batch 22900, loss[loss=0.2112, simple_loss=0.3063, pruned_loss=0.05804, over 21736.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.2889, pruned_loss=0.07463, over 4263885.73 frames. ], batch size: 247, lr: 4.23e-03, grad_scale: 16.0 2023-06-25 02:21:11,886 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1235202.0, ans=0.125 2023-06-25 02:21:29,682 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1235202.0, ans=0.125 2023-06-25 02:21:32,976 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1235262.0, ans=0.0 2023-06-25 02:21:38,437 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1235262.0, ans=0.125 2023-06-25 02:22:27,208 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1235382.0, ans=0.1 2023-06-25 02:23:04,026 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.336e+02 3.455e+02 4.744e+02 6.371e+02 1.430e+03, threshold=9.487e+02, percent-clipped=13.0 2023-06-25 02:23:05,581 INFO [train.py:996] (2/4) Epoch 7, batch 22950, loss[loss=0.2624, simple_loss=0.3798, pruned_loss=0.07251, over 21660.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.3005, pruned_loss=0.07344, over 4264591.44 frames. ], batch size: 414, lr: 4.23e-03, grad_scale: 16.0 2023-06-25 02:23:28,804 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1235562.0, ans=0.125 2023-06-25 02:24:03,932 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.36 vs. limit=6.0 2023-06-25 02:24:53,124 INFO [train.py:996] (2/4) Epoch 7, batch 23000, loss[loss=0.2178, simple_loss=0.2986, pruned_loss=0.06846, over 21672.00 frames. ], tot_loss[loss=0.2212, simple_loss=0.2995, pruned_loss=0.07145, over 4266377.48 frames. ], batch size: 389, lr: 4.23e-03, grad_scale: 16.0 2023-06-25 02:25:02,931 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1235802.0, ans=0.125 2023-06-25 02:26:40,406 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.181e+02 3.049e+02 3.858e+02 4.759e+02 9.781e+02, threshold=7.716e+02, percent-clipped=2.0 2023-06-25 02:26:42,798 INFO [train.py:996] (2/4) Epoch 7, batch 23050, loss[loss=0.2581, simple_loss=0.3293, pruned_loss=0.0935, over 21829.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.3007, pruned_loss=0.07271, over 4266549.69 frames. ], batch size: 247, lr: 4.23e-03, grad_scale: 16.0 2023-06-25 02:27:39,131 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1236222.0, ans=0.2 2023-06-25 02:27:43,712 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1236222.0, ans=0.0 2023-06-25 02:28:31,620 INFO [train.py:996] (2/4) Epoch 7, batch 23100, loss[loss=0.2793, simple_loss=0.3354, pruned_loss=0.1116, over 21401.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.2975, pruned_loss=0.07296, over 4265939.22 frames. ], batch size: 471, lr: 4.23e-03, grad_scale: 16.0 2023-06-25 02:29:24,852 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1236522.0, ans=0.125 2023-06-25 02:29:38,595 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1236582.0, ans=0.1 2023-06-25 02:30:16,717 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.225e+02 3.056e+02 3.591e+02 4.604e+02 9.748e+02, threshold=7.182e+02, percent-clipped=1.0 2023-06-25 02:30:18,316 INFO [train.py:996] (2/4) Epoch 7, batch 23150, loss[loss=0.1966, simple_loss=0.2653, pruned_loss=0.06399, over 21842.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.2936, pruned_loss=0.07282, over 4275488.45 frames. ], batch size: 298, lr: 4.23e-03, grad_scale: 16.0 2023-06-25 02:31:10,911 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.47 vs. limit=22.5 2023-06-25 02:31:24,292 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.40 vs. limit=10.0 2023-06-25 02:32:03,902 INFO [train.py:996] (2/4) Epoch 7, batch 23200, loss[loss=0.2344, simple_loss=0.2912, pruned_loss=0.08879, over 21614.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.2935, pruned_loss=0.0739, over 4282969.80 frames. ], batch size: 471, lr: 4.23e-03, grad_scale: 32.0 2023-06-25 02:32:13,839 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.64 vs. limit=10.0 2023-06-25 02:32:20,179 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1237062.0, ans=0.0 2023-06-25 02:32:23,843 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1237062.0, ans=0.0 2023-06-25 02:32:27,042 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1237062.0, ans=0.125 2023-06-25 02:32:38,559 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1237062.0, ans=0.0 2023-06-25 02:32:55,837 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1237122.0, ans=0.1 2023-06-25 02:33:49,195 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1237242.0, ans=0.125 2023-06-25 02:33:52,461 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.409e+02 3.126e+02 3.728e+02 5.060e+02 1.069e+03, threshold=7.456e+02, percent-clipped=4.0 2023-06-25 02:33:52,499 INFO [train.py:996] (2/4) Epoch 7, batch 23250, loss[loss=0.2153, simple_loss=0.2807, pruned_loss=0.075, over 21809.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.292, pruned_loss=0.07433, over 4286980.63 frames. ], batch size: 247, lr: 4.22e-03, grad_scale: 16.0 2023-06-25 02:34:31,015 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1237362.0, ans=0.2 2023-06-25 02:34:35,321 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1237362.0, ans=0.1 2023-06-25 02:34:50,247 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.92 vs. limit=15.0 2023-06-25 02:35:43,502 INFO [train.py:996] (2/4) Epoch 7, batch 23300, loss[loss=0.2136, simple_loss=0.3015, pruned_loss=0.06287, over 21854.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.3001, pruned_loss=0.07629, over 4291643.36 frames. ], batch size: 124, lr: 4.22e-03, grad_scale: 16.0 2023-06-25 02:36:16,234 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1237662.0, ans=0.1 2023-06-25 02:36:30,792 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1237722.0, ans=0.0 2023-06-25 02:36:39,868 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1237722.0, ans=0.0 2023-06-25 02:36:39,961 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1237722.0, ans=0.2 2023-06-25 02:37:06,280 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.59 vs. limit=15.0 2023-06-25 02:37:19,632 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1237842.0, ans=0.0 2023-06-25 02:37:23,208 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1237842.0, ans=0.0 2023-06-25 02:37:39,011 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.304e+02 3.209e+02 3.833e+02 5.523e+02 1.342e+03, threshold=7.666e+02, percent-clipped=15.0 2023-06-25 02:37:39,051 INFO [train.py:996] (2/4) Epoch 7, batch 23350, loss[loss=0.1602, simple_loss=0.2394, pruned_loss=0.0405, over 21241.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.3037, pruned_loss=0.07554, over 4292510.51 frames. ], batch size: 159, lr: 4.22e-03, grad_scale: 16.0 2023-06-25 02:38:12,878 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1237962.0, ans=0.125 2023-06-25 02:38:45,958 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1238082.0, ans=0.125 2023-06-25 02:39:33,723 INFO [train.py:996] (2/4) Epoch 7, batch 23400, loss[loss=0.2181, simple_loss=0.2979, pruned_loss=0.06921, over 21933.00 frames. ], tot_loss[loss=0.2215, simple_loss=0.2976, pruned_loss=0.07269, over 4287178.77 frames. ], batch size: 107, lr: 4.22e-03, grad_scale: 16.0 2023-06-25 02:39:55,094 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1238262.0, ans=0.0 2023-06-25 02:40:55,231 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1238382.0, ans=0.0 2023-06-25 02:41:23,137 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.927e+02 3.153e+02 4.336e+02 5.410e+02 1.099e+03, threshold=8.672e+02, percent-clipped=12.0 2023-06-25 02:41:23,169 INFO [train.py:996] (2/4) Epoch 7, batch 23450, loss[loss=0.2802, simple_loss=0.3561, pruned_loss=0.1021, over 21860.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.2999, pruned_loss=0.07394, over 4282248.51 frames. ], batch size: 124, lr: 4.22e-03, grad_scale: 16.0 2023-06-25 02:41:32,473 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1238502.0, ans=0.0 2023-06-25 02:41:44,551 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1238562.0, ans=0.125 2023-06-25 02:41:47,101 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.29 vs. limit=6.0 2023-06-25 02:42:20,042 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1238622.0, ans=0.125 2023-06-25 02:42:27,092 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1238682.0, ans=0.0 2023-06-25 02:42:57,706 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1238742.0, ans=0.09899494936611666 2023-06-25 02:43:06,182 INFO [train.py:996] (2/4) Epoch 7, batch 23500, loss[loss=0.1955, simple_loss=0.2653, pruned_loss=0.06279, over 21549.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.301, pruned_loss=0.07562, over 4284013.23 frames. ], batch size: 195, lr: 4.22e-03, grad_scale: 16.0 2023-06-25 02:43:38,197 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.64 vs. limit=15.0 2023-06-25 02:44:53,876 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.391e+02 2.970e+02 3.465e+02 4.227e+02 7.885e+02, threshold=6.930e+02, percent-clipped=0.0 2023-06-25 02:44:53,911 INFO [train.py:996] (2/4) Epoch 7, batch 23550, loss[loss=0.2059, simple_loss=0.2654, pruned_loss=0.07313, over 21478.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.2954, pruned_loss=0.07575, over 4274017.42 frames. ], batch size: 548, lr: 4.22e-03, grad_scale: 16.0 2023-06-25 02:45:04,197 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.36 vs. limit=22.5 2023-06-25 02:45:33,119 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1239222.0, ans=0.0 2023-06-25 02:46:02,139 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.34 vs. limit=22.5 2023-06-25 02:46:14,175 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.99 vs. limit=15.0 2023-06-25 02:46:42,571 INFO [train.py:996] (2/4) Epoch 7, batch 23600, loss[loss=0.1881, simple_loss=0.2371, pruned_loss=0.06955, over 20881.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.2952, pruned_loss=0.076, over 4272853.35 frames. ], batch size: 608, lr: 4.22e-03, grad_scale: 32.0 2023-06-25 02:47:08,684 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1239462.0, ans=0.1 2023-06-25 02:47:08,849 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1239462.0, ans=0.1 2023-06-25 02:47:12,509 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1239462.0, ans=0.0 2023-06-25 02:47:35,722 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1239522.0, ans=0.1 2023-06-25 02:47:52,125 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1239582.0, ans=0.0 2023-06-25 02:47:58,976 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1239582.0, ans=0.125 2023-06-25 02:48:28,076 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.331e+02 3.161e+02 4.117e+02 5.105e+02 1.053e+03, threshold=8.234e+02, percent-clipped=8.0 2023-06-25 02:48:28,122 INFO [train.py:996] (2/4) Epoch 7, batch 23650, loss[loss=0.2306, simple_loss=0.3128, pruned_loss=0.07414, over 21914.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.2959, pruned_loss=0.07439, over 4267483.21 frames. ], batch size: 371, lr: 4.22e-03, grad_scale: 32.0 2023-06-25 02:48:32,510 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.89 vs. limit=22.5 2023-06-25 02:48:39,136 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1239702.0, ans=0.125 2023-06-25 02:49:17,593 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1239822.0, ans=0.0 2023-06-25 02:49:58,224 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1239942.0, ans=0.0 2023-06-25 02:50:17,009 INFO [train.py:996] (2/4) Epoch 7, batch 23700, loss[loss=0.2228, simple_loss=0.3066, pruned_loss=0.06947, over 21578.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.2981, pruned_loss=0.07435, over 4267467.56 frames. ], batch size: 441, lr: 4.22e-03, grad_scale: 16.0 2023-06-25 02:51:31,423 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.41 vs. limit=15.0 2023-06-25 02:51:35,389 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1240182.0, ans=0.125 2023-06-25 02:51:41,392 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.68 vs. limit=22.5 2023-06-25 02:51:49,972 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1240242.0, ans=0.125 2023-06-25 02:52:10,057 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1240242.0, ans=0.0 2023-06-25 02:52:12,620 INFO [train.py:996] (2/4) Epoch 7, batch 23750, loss[loss=0.2227, simple_loss=0.2959, pruned_loss=0.07472, over 21743.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.3003, pruned_loss=0.07523, over 4269126.85 frames. ], batch size: 247, lr: 4.22e-03, grad_scale: 16.0 2023-06-25 02:52:14,424 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.255e+02 3.374e+02 3.894e+02 5.027e+02 8.477e+02, threshold=7.788e+02, percent-clipped=1.0 2023-06-25 02:52:42,165 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1240362.0, ans=0.0 2023-06-25 02:53:30,688 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1240482.0, ans=0.0 2023-06-25 02:53:34,125 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1240542.0, ans=0.0 2023-06-25 02:53:46,338 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1240542.0, ans=0.95 2023-06-25 02:54:03,190 INFO [train.py:996] (2/4) Epoch 7, batch 23800, loss[loss=0.2857, simple_loss=0.3798, pruned_loss=0.09581, over 21589.00 frames. ], tot_loss[loss=0.223, simple_loss=0.2998, pruned_loss=0.0731, over 4269789.75 frames. ], batch size: 414, lr: 4.22e-03, grad_scale: 16.0 2023-06-25 02:54:20,995 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1240602.0, ans=0.0 2023-06-25 02:54:25,856 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1240602.0, ans=0.0 2023-06-25 02:55:05,461 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1240722.0, ans=0.07 2023-06-25 02:55:40,292 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1240842.0, ans=0.125 2023-06-25 02:56:06,045 INFO [train.py:996] (2/4) Epoch 7, batch 23850, loss[loss=0.3366, simple_loss=0.3986, pruned_loss=0.1373, over 21390.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3076, pruned_loss=0.07492, over 4271461.71 frames. ], batch size: 507, lr: 4.22e-03, grad_scale: 16.0 2023-06-25 02:56:07,971 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.236e+02 3.127e+02 4.092e+02 4.859e+02 9.689e+02, threshold=8.184e+02, percent-clipped=5.0 2023-06-25 02:56:21,537 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1240902.0, ans=0.125 2023-06-25 02:56:41,483 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1241022.0, ans=0.125 2023-06-25 02:57:04,129 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.14 vs. limit=15.0 2023-06-25 02:57:48,842 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1241142.0, ans=0.125 2023-06-25 02:57:49,438 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.80 vs. limit=6.0 2023-06-25 02:57:55,510 INFO [train.py:996] (2/4) Epoch 7, batch 23900, loss[loss=0.2268, simple_loss=0.3234, pruned_loss=0.06506, over 21717.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3147, pruned_loss=0.0772, over 4270107.22 frames. ], batch size: 332, lr: 4.22e-03, grad_scale: 16.0 2023-06-25 02:58:37,285 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 02:58:49,051 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1241322.0, ans=0.125 2023-06-25 02:59:09,778 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1241382.0, ans=0.1 2023-06-25 02:59:19,316 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1241442.0, ans=0.125 2023-06-25 02:59:21,531 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1241442.0, ans=0.0 2023-06-25 02:59:38,329 INFO [train.py:996] (2/4) Epoch 7, batch 23950, loss[loss=0.2347, simple_loss=0.3197, pruned_loss=0.07487, over 21451.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3086, pruned_loss=0.07635, over 4268146.51 frames. ], batch size: 131, lr: 4.22e-03, grad_scale: 16.0 2023-06-25 02:59:39,938 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.625e+02 3.372e+02 4.562e+02 5.557e+02 1.074e+03, threshold=9.124e+02, percent-clipped=7.0 2023-06-25 03:00:07,787 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.90 vs. limit=15.0 2023-06-25 03:00:41,081 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1241682.0, ans=0.025 2023-06-25 03:00:52,696 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1241682.0, ans=0.0 2023-06-25 03:01:27,367 INFO [train.py:996] (2/4) Epoch 7, batch 24000, loss[loss=0.3277, simple_loss=0.3752, pruned_loss=0.1401, over 21467.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3091, pruned_loss=0.07887, over 4270021.90 frames. ], batch size: 510, lr: 4.22e-03, grad_scale: 32.0 2023-06-25 03:01:27,368 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-25 03:01:45,557 INFO [train.py:1028] (2/4) Epoch 7, validation: loss=0.2668, simple_loss=0.3629, pruned_loss=0.0854, over 1796401.00 frames. 2023-06-25 03:01:45,558 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 23731MB 2023-06-25 03:01:46,730 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1241802.0, ans=0.0 2023-06-25 03:01:58,131 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1241802.0, ans=0.125 2023-06-25 03:02:31,863 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1241922.0, ans=0.0 2023-06-25 03:02:36,445 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1241922.0, ans=0.0 2023-06-25 03:03:06,922 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=1241982.0, ans=15.0 2023-06-25 03:03:27,123 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=1242042.0, ans=0.05 2023-06-25 03:03:35,965 INFO [train.py:996] (2/4) Epoch 7, batch 24050, loss[loss=0.1809, simple_loss=0.2727, pruned_loss=0.04453, over 21758.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3099, pruned_loss=0.07884, over 4273511.27 frames. ], batch size: 247, lr: 4.22e-03, grad_scale: 16.0 2023-06-25 03:03:39,410 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.322e+02 3.516e+02 4.440e+02 5.748e+02 1.093e+03, threshold=8.881e+02, percent-clipped=2.0 2023-06-25 03:03:46,343 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1242102.0, ans=0.125 2023-06-25 03:03:48,048 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1242102.0, ans=0.125 2023-06-25 03:03:50,131 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1242102.0, ans=0.1 2023-06-25 03:04:23,909 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1242222.0, ans=0.125 2023-06-25 03:04:30,468 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.90 vs. limit=22.5 2023-06-25 03:04:45,834 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1242282.0, ans=0.5 2023-06-25 03:05:19,280 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1242402.0, ans=0.0 2023-06-25 03:05:20,279 INFO [train.py:996] (2/4) Epoch 7, batch 24100, loss[loss=0.2543, simple_loss=0.337, pruned_loss=0.08582, over 21693.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3094, pruned_loss=0.07739, over 4273403.33 frames. ], batch size: 351, lr: 4.22e-03, grad_scale: 16.0 2023-06-25 03:05:21,624 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.41 vs. limit=15.0 2023-06-25 03:06:29,168 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.63 vs. limit=12.0 2023-06-25 03:06:37,852 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1242582.0, ans=0.125 2023-06-25 03:06:50,799 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_ff3.min_abs, batch_count=1242642.0, ans=0.2 2023-06-25 03:07:09,427 INFO [train.py:996] (2/4) Epoch 7, batch 24150, loss[loss=0.2399, simple_loss=0.3066, pruned_loss=0.08662, over 21921.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.3077, pruned_loss=0.07848, over 4280957.43 frames. ], batch size: 333, lr: 4.22e-03, grad_scale: 16.0 2023-06-25 03:07:12,837 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.342e+02 3.235e+02 4.030e+02 4.867e+02 1.048e+03, threshold=8.060e+02, percent-clipped=3.0 2023-06-25 03:07:17,424 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1242702.0, ans=0.2 2023-06-25 03:08:21,960 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1242882.0, ans=0.125 2023-06-25 03:08:38,133 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.02 vs. limit=22.5 2023-06-25 03:08:53,071 INFO [train.py:996] (2/4) Epoch 7, batch 24200, loss[loss=0.1964, simple_loss=0.2762, pruned_loss=0.05835, over 21420.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3099, pruned_loss=0.07962, over 4284181.43 frames. ], batch size: 131, lr: 4.22e-03, grad_scale: 16.0 2023-06-25 03:10:20,739 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1243242.0, ans=0.0 2023-06-25 03:10:26,767 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1243242.0, ans=0.0 2023-06-25 03:10:28,257 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1243242.0, ans=0.2 2023-06-25 03:10:29,814 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1243242.0, ans=0.125 2023-06-25 03:10:48,486 INFO [train.py:996] (2/4) Epoch 7, batch 24250, loss[loss=0.1944, simple_loss=0.2953, pruned_loss=0.04672, over 21683.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.3071, pruned_loss=0.07366, over 4286106.11 frames. ], batch size: 414, lr: 4.21e-03, grad_scale: 16.0 2023-06-25 03:10:49,038 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1243302.0, ans=0.0 2023-06-25 03:10:51,921 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.960e+02 3.061e+02 3.870e+02 4.839e+02 8.744e+02, threshold=7.741e+02, percent-clipped=3.0 2023-06-25 03:11:33,633 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=1243362.0, ans=10.0 2023-06-25 03:12:38,086 INFO [train.py:996] (2/4) Epoch 7, batch 24300, loss[loss=0.1427, simple_loss=0.2148, pruned_loss=0.03529, over 21230.00 frames. ], tot_loss[loss=0.2181, simple_loss=0.2999, pruned_loss=0.06814, over 4282458.46 frames. ], batch size: 176, lr: 4.21e-03, grad_scale: 16.0 2023-06-25 03:12:38,683 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1243602.0, ans=0.125 2023-06-25 03:12:52,289 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1243602.0, ans=0.035 2023-06-25 03:13:08,266 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1243662.0, ans=0.2 2023-06-25 03:13:15,930 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1243662.0, ans=0.0 2023-06-25 03:13:18,847 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1243662.0, ans=0.2 2023-06-25 03:13:18,982 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1243662.0, ans=0.0 2023-06-25 03:13:25,785 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1243722.0, ans=0.125 2023-06-25 03:14:08,314 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.48 vs. limit=15.0 2023-06-25 03:14:19,855 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1243842.0, ans=0.2 2023-06-25 03:14:26,072 INFO [train.py:996] (2/4) Epoch 7, batch 24350, loss[loss=0.2186, simple_loss=0.2953, pruned_loss=0.0709, over 21897.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2957, pruned_loss=0.06787, over 4286022.95 frames. ], batch size: 316, lr: 4.21e-03, grad_scale: 16.0 2023-06-25 03:14:34,783 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.057e+02 2.804e+02 3.474e+02 4.596e+02 8.821e+02, threshold=6.948e+02, percent-clipped=1.0 2023-06-25 03:14:46,835 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1243902.0, ans=0.0 2023-06-25 03:15:21,756 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1244022.0, ans=0.125 2023-06-25 03:15:39,728 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.69 vs. limit=15.0 2023-06-25 03:16:20,443 INFO [train.py:996] (2/4) Epoch 7, batch 24400, loss[loss=0.2195, simple_loss=0.29, pruned_loss=0.07451, over 20665.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.3022, pruned_loss=0.07213, over 4284725.38 frames. ], batch size: 607, lr: 4.21e-03, grad_scale: 32.0 2023-06-25 03:16:29,909 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1244202.0, ans=0.125 2023-06-25 03:18:15,748 INFO [train.py:996] (2/4) Epoch 7, batch 24450, loss[loss=0.1961, simple_loss=0.2778, pruned_loss=0.05723, over 21286.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.3046, pruned_loss=0.07382, over 4282959.52 frames. ], batch size: 159, lr: 4.21e-03, grad_scale: 32.0 2023-06-25 03:18:19,270 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.592e+02 3.443e+02 3.965e+02 5.571e+02 1.139e+03, threshold=7.931e+02, percent-clipped=16.0 2023-06-25 03:20:03,676 INFO [train.py:996] (2/4) Epoch 7, batch 24500, loss[loss=0.214, simple_loss=0.2934, pruned_loss=0.06726, over 21917.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.3055, pruned_loss=0.07396, over 4283616.24 frames. ], batch size: 316, lr: 4.21e-03, grad_scale: 16.0 2023-06-25 03:20:22,182 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1244862.0, ans=0.0 2023-06-25 03:20:51,108 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1244922.0, ans=0.125 2023-06-25 03:21:23,255 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=1244982.0, ans=6.0 2023-06-25 03:21:48,795 INFO [train.py:996] (2/4) Epoch 7, batch 24550, loss[loss=0.2381, simple_loss=0.3175, pruned_loss=0.07928, over 21589.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.3067, pruned_loss=0.0758, over 4287953.62 frames. ], batch size: 263, lr: 4.21e-03, grad_scale: 16.0 2023-06-25 03:21:53,850 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.291e+02 2.970e+02 3.569e+02 4.682e+02 1.145e+03, threshold=7.139e+02, percent-clipped=2.0 2023-06-25 03:21:58,934 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.74 vs. limit=15.0 2023-06-25 03:23:08,248 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1245282.0, ans=0.1 2023-06-25 03:23:13,351 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1245342.0, ans=0.0 2023-06-25 03:23:22,707 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.42 vs. limit=10.0 2023-06-25 03:23:24,036 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1245342.0, ans=0.05 2023-06-25 03:23:31,389 INFO [train.py:996] (2/4) Epoch 7, batch 24600, loss[loss=0.2845, simple_loss=0.3287, pruned_loss=0.1201, over 21359.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.3057, pruned_loss=0.07761, over 4291817.39 frames. ], batch size: 507, lr: 4.21e-03, grad_scale: 16.0 2023-06-25 03:23:54,120 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1245462.0, ans=0.125 2023-06-25 03:24:02,252 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1245462.0, ans=0.1 2023-06-25 03:24:14,405 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1245522.0, ans=0.125 2023-06-25 03:24:39,290 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1245582.0, ans=0.2 2023-06-25 03:24:52,099 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.75 vs. limit=15.0 2023-06-25 03:25:04,183 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1245642.0, ans=0.125 2023-06-25 03:25:04,223 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1245642.0, ans=0.1 2023-06-25 03:25:09,280 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1245642.0, ans=0.125 2023-06-25 03:25:11,291 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1245642.0, ans=0.125 2023-06-25 03:25:14,544 INFO [train.py:996] (2/4) Epoch 7, batch 24650, loss[loss=0.2037, simple_loss=0.2808, pruned_loss=0.0633, over 21245.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.2993, pruned_loss=0.07601, over 4278784.25 frames. ], batch size: 159, lr: 4.21e-03, grad_scale: 16.0 2023-06-25 03:25:15,073 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1245702.0, ans=0.125 2023-06-25 03:25:19,707 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.334e+02 3.258e+02 3.830e+02 5.672e+02 1.406e+03, threshold=7.660e+02, percent-clipped=13.0 2023-06-25 03:25:34,927 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.55 vs. limit=22.5 2023-06-25 03:26:18,854 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1245822.0, ans=0.1 2023-06-25 03:27:02,299 INFO [train.py:996] (2/4) Epoch 7, batch 24700, loss[loss=0.2088, simple_loss=0.2777, pruned_loss=0.06996, over 21737.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.2961, pruned_loss=0.07375, over 4285233.29 frames. ], batch size: 112, lr: 4.21e-03, grad_scale: 16.0 2023-06-25 03:27:02,974 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1246002.0, ans=0.025 2023-06-25 03:27:06,120 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1246002.0, ans=0.2 2023-06-25 03:27:28,907 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1246062.0, ans=0.025 2023-06-25 03:28:09,105 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.48 vs. limit=15.0 2023-06-25 03:28:31,786 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1246242.0, ans=0.0 2023-06-25 03:28:49,811 INFO [train.py:996] (2/4) Epoch 7, batch 24750, loss[loss=0.2061, simple_loss=0.2648, pruned_loss=0.07371, over 21329.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2885, pruned_loss=0.07103, over 4262152.77 frames. ], batch size: 160, lr: 4.21e-03, grad_scale: 16.0 2023-06-25 03:28:54,679 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.316e+02 2.901e+02 3.279e+02 4.785e+02 1.213e+03, threshold=6.557e+02, percent-clipped=5.0 2023-06-25 03:29:43,749 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.45 vs. limit=15.0 2023-06-25 03:30:14,590 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1246482.0, ans=0.1 2023-06-25 03:30:15,939 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1246482.0, ans=0.2 2023-06-25 03:30:25,220 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.44 vs. limit=10.0 2023-06-25 03:30:35,892 INFO [train.py:996] (2/4) Epoch 7, batch 24800, loss[loss=0.2054, simple_loss=0.2774, pruned_loss=0.06671, over 21842.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.2837, pruned_loss=0.07056, over 4269915.33 frames. ], batch size: 333, lr: 4.21e-03, grad_scale: 32.0 2023-06-25 03:30:54,758 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1246662.0, ans=0.0 2023-06-25 03:31:04,084 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.23 vs. limit=15.0 2023-06-25 03:31:49,751 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1246782.0, ans=0.04949747468305833 2023-06-25 03:32:06,661 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1246842.0, ans=0.125 2023-06-25 03:32:23,879 INFO [train.py:996] (2/4) Epoch 7, batch 24850, loss[loss=0.1865, simple_loss=0.2486, pruned_loss=0.06225, over 21828.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2841, pruned_loss=0.07219, over 4274617.31 frames. ], batch size: 118, lr: 4.21e-03, grad_scale: 16.0 2023-06-25 03:32:30,870 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.467e+02 3.124e+02 3.906e+02 4.909e+02 9.613e+02, threshold=7.812e+02, percent-clipped=9.0 2023-06-25 03:33:28,392 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1247022.0, ans=0.09899494936611666 2023-06-25 03:33:32,234 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.47 vs. limit=15.0 2023-06-25 03:33:54,255 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.68 vs. limit=6.0 2023-06-25 03:34:14,049 INFO [train.py:996] (2/4) Epoch 7, batch 24900, loss[loss=0.2318, simple_loss=0.3008, pruned_loss=0.08139, over 21583.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.2886, pruned_loss=0.07355, over 4277762.16 frames. ], batch size: 230, lr: 4.21e-03, grad_scale: 16.0 2023-06-25 03:34:26,978 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.13 vs. limit=10.0 2023-06-25 03:34:56,929 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1247262.0, ans=0.125 2023-06-25 03:35:17,877 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1247322.0, ans=0.2 2023-06-25 03:35:43,392 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 03:35:47,096 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1247442.0, ans=0.125 2023-06-25 03:36:08,381 INFO [train.py:996] (2/4) Epoch 7, batch 24950, loss[loss=0.23, simple_loss=0.3029, pruned_loss=0.07853, over 21819.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.2957, pruned_loss=0.07708, over 4277173.47 frames. ], batch size: 282, lr: 4.21e-03, grad_scale: 16.0 2023-06-25 03:36:15,228 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.748e+02 3.765e+02 4.804e+02 6.774e+02 1.687e+03, threshold=9.608e+02, percent-clipped=17.0 2023-06-25 03:36:49,303 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1247562.0, ans=0.125 2023-06-25 03:36:52,967 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1247622.0, ans=0.1 2023-06-25 03:37:09,139 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1247622.0, ans=0.125 2023-06-25 03:37:11,170 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1247622.0, ans=0.1 2023-06-25 03:37:21,355 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1247682.0, ans=0.1 2023-06-25 03:37:24,938 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1247682.0, ans=0.0 2023-06-25 03:37:57,854 INFO [train.py:996] (2/4) Epoch 7, batch 25000, loss[loss=0.1998, simple_loss=0.2628, pruned_loss=0.06838, over 21381.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3004, pruned_loss=0.07833, over 4279975.55 frames. ], batch size: 211, lr: 4.21e-03, grad_scale: 16.0 2023-06-25 03:37:58,664 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1247802.0, ans=0.2 2023-06-25 03:38:21,234 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1247802.0, ans=0.0 2023-06-25 03:38:26,578 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1247862.0, ans=0.09899494936611666 2023-06-25 03:38:33,430 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1247862.0, ans=10.0 2023-06-25 03:39:06,168 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1247982.0, ans=0.125 2023-06-25 03:39:34,632 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1248042.0, ans=0.0 2023-06-25 03:39:47,813 INFO [train.py:996] (2/4) Epoch 7, batch 25050, loss[loss=0.1952, simple_loss=0.2581, pruned_loss=0.06615, over 21521.00 frames. ], tot_loss[loss=0.224, simple_loss=0.2942, pruned_loss=0.07685, over 4270310.85 frames. ], batch size: 195, lr: 4.21e-03, grad_scale: 16.0 2023-06-25 03:39:59,669 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.536e+02 3.278e+02 3.984e+02 5.261e+02 1.222e+03, threshold=7.967e+02, percent-clipped=1.0 2023-06-25 03:40:40,743 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1248222.0, ans=0.125 2023-06-25 03:40:49,361 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1248222.0, ans=0.05 2023-06-25 03:41:06,737 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1248282.0, ans=0.125 2023-06-25 03:41:07,287 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.66 vs. limit=15.0 2023-06-25 03:41:35,616 INFO [train.py:996] (2/4) Epoch 7, batch 25100, loss[loss=0.1851, simple_loss=0.2499, pruned_loss=0.06011, over 21257.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.2884, pruned_loss=0.07464, over 4267404.95 frames. ], batch size: 176, lr: 4.21e-03, grad_scale: 16.0 2023-06-25 03:42:38,097 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.45 vs. limit=15.0 2023-06-25 03:42:49,325 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1248582.0, ans=0.125 2023-06-25 03:43:15,192 INFO [train.py:996] (2/4) Epoch 7, batch 25150, loss[loss=0.2102, simple_loss=0.2985, pruned_loss=0.06094, over 21439.00 frames. ], tot_loss[loss=0.2199, simple_loss=0.2923, pruned_loss=0.07373, over 4254205.32 frames. ], batch size: 131, lr: 4.21e-03, grad_scale: 16.0 2023-06-25 03:43:17,282 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1248702.0, ans=0.125 2023-06-25 03:43:22,418 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.352e+02 2.917e+02 3.507e+02 4.290e+02 7.134e+02, threshold=7.014e+02, percent-clipped=0.0 2023-06-25 03:43:56,526 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1248762.0, ans=0.0 2023-06-25 03:44:02,897 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.56 vs. limit=6.0 2023-06-25 03:44:48,884 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.69 vs. limit=22.5 2023-06-25 03:44:48,904 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.46 vs. limit=15.0 2023-06-25 03:45:03,188 INFO [train.py:996] (2/4) Epoch 7, batch 25200, loss[loss=0.2002, simple_loss=0.2821, pruned_loss=0.05912, over 21698.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.2916, pruned_loss=0.07099, over 4249874.94 frames. ], batch size: 298, lr: 4.21e-03, grad_scale: 32.0 2023-06-25 03:45:51,516 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1249122.0, ans=0.0 2023-06-25 03:46:17,831 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1249182.0, ans=0.125 2023-06-25 03:46:44,540 INFO [train.py:996] (2/4) Epoch 7, batch 25250, loss[loss=0.1849, simple_loss=0.2589, pruned_loss=0.05547, over 21142.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.289, pruned_loss=0.06958, over 4252066.11 frames. ], batch size: 548, lr: 4.20e-03, grad_scale: 32.0 2023-06-25 03:46:45,019 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1249302.0, ans=0.0 2023-06-25 03:46:50,732 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.310e+02 3.493e+02 4.531e+02 6.299e+02 1.264e+03, threshold=9.062e+02, percent-clipped=19.0 2023-06-25 03:47:15,161 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1249362.0, ans=0.0 2023-06-25 03:47:25,807 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1249362.0, ans=0.2 2023-06-25 03:48:05,074 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.99 vs. limit=22.5 2023-06-25 03:48:16,086 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.59 vs. limit=15.0 2023-06-25 03:48:32,322 INFO [train.py:996] (2/4) Epoch 7, batch 25300, loss[loss=0.1795, simple_loss=0.2688, pruned_loss=0.04511, over 21766.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2876, pruned_loss=0.0697, over 4256055.51 frames. ], batch size: 351, lr: 4.20e-03, grad_scale: 32.0 2023-06-25 03:48:32,868 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1249602.0, ans=0.125 2023-06-25 03:48:59,362 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.10 vs. limit=15.0 2023-06-25 03:49:09,554 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1249662.0, ans=0.125 2023-06-25 03:49:31,755 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1249722.0, ans=0.0 2023-06-25 03:49:31,818 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1249722.0, ans=0.125 2023-06-25 03:50:12,400 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.80 vs. limit=22.5 2023-06-25 03:50:20,525 INFO [train.py:996] (2/4) Epoch 7, batch 25350, loss[loss=0.181, simple_loss=0.2764, pruned_loss=0.04278, over 21748.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2916, pruned_loss=0.06938, over 4255333.01 frames. ], batch size: 332, lr: 4.20e-03, grad_scale: 16.0 2023-06-25 03:50:29,461 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.308e+02 2.853e+02 3.365e+02 4.532e+02 7.857e+02, threshold=6.730e+02, percent-clipped=0.0 2023-06-25 03:51:00,917 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1250022.0, ans=0.04949747468305833 2023-06-25 03:51:02,670 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 03:51:02,796 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 03:51:45,361 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1250142.0, ans=0.1 2023-06-25 03:52:03,107 INFO [train.py:996] (2/4) Epoch 7, batch 25400, loss[loss=0.1749, simple_loss=0.2595, pruned_loss=0.0451, over 21608.00 frames. ], tot_loss[loss=0.2119, simple_loss=0.2869, pruned_loss=0.06843, over 4263769.51 frames. ], batch size: 263, lr: 4.20e-03, grad_scale: 16.0 2023-06-25 03:52:10,774 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 03:52:12,235 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1250202.0, ans=0.0 2023-06-25 03:52:14,631 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.62 vs. limit=15.0 2023-06-25 03:53:46,477 INFO [train.py:996] (2/4) Epoch 7, batch 25450, loss[loss=0.2094, simple_loss=0.306, pruned_loss=0.05634, over 21811.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.287, pruned_loss=0.06962, over 4276611.83 frames. ], batch size: 371, lr: 4.20e-03, grad_scale: 16.0 2023-06-25 03:53:54,042 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1250502.0, ans=0.0 2023-06-25 03:53:55,096 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.119e+02 2.979e+02 3.775e+02 5.252e+02 7.977e+02, threshold=7.549e+02, percent-clipped=6.0 2023-06-25 03:54:34,489 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.36 vs. limit=15.0 2023-06-25 03:54:42,371 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1250622.0, ans=0.0 2023-06-25 03:54:47,659 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1250682.0, ans=0.2 2023-06-25 03:55:32,119 INFO [train.py:996] (2/4) Epoch 7, batch 25500, loss[loss=0.2326, simple_loss=0.3178, pruned_loss=0.07373, over 21754.00 frames. ], tot_loss[loss=0.2103, simple_loss=0.2877, pruned_loss=0.0665, over 4265383.77 frames. ], batch size: 351, lr: 4.20e-03, grad_scale: 16.0 2023-06-25 03:56:00,015 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.13 vs. limit=6.0 2023-06-25 03:56:08,228 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1250862.0, ans=0.05 2023-06-25 03:56:13,655 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1250862.0, ans=0.125 2023-06-25 03:56:15,304 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1250862.0, ans=0.0 2023-06-25 03:56:17,620 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.09 vs. limit=10.0 2023-06-25 03:56:45,461 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.44 vs. limit=15.0 2023-06-25 03:56:46,252 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1250982.0, ans=0.0 2023-06-25 03:56:46,401 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1250982.0, ans=0.125 2023-06-25 03:57:27,553 INFO [train.py:996] (2/4) Epoch 7, batch 25550, loss[loss=0.1978, simple_loss=0.2813, pruned_loss=0.05719, over 21202.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2948, pruned_loss=0.06706, over 4268291.57 frames. ], batch size: 159, lr: 4.20e-03, grad_scale: 16.0 2023-06-25 03:57:41,639 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.140e+02 3.132e+02 4.314e+02 5.832e+02 9.037e+02, threshold=8.627e+02, percent-clipped=4.0 2023-06-25 03:58:33,918 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1251282.0, ans=0.125 2023-06-25 03:58:46,797 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.59 vs. limit=22.5 2023-06-25 03:59:21,934 INFO [train.py:996] (2/4) Epoch 7, batch 25600, loss[loss=0.2504, simple_loss=0.3359, pruned_loss=0.0824, over 21340.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.2987, pruned_loss=0.06816, over 4278225.91 frames. ], batch size: 131, lr: 4.20e-03, grad_scale: 32.0 2023-06-25 04:00:25,986 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1251582.0, ans=0.125 2023-06-25 04:00:56,301 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.96 vs. limit=15.0 2023-06-25 04:01:04,282 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1251642.0, ans=0.0 2023-06-25 04:01:09,220 INFO [train.py:996] (2/4) Epoch 7, batch 25650, loss[loss=0.2036, simple_loss=0.27, pruned_loss=0.06856, over 21764.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.2987, pruned_loss=0.07037, over 4271841.05 frames. ], batch size: 317, lr: 4.20e-03, grad_scale: 16.0 2023-06-25 04:01:19,263 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.352e+02 3.050e+02 3.577e+02 4.545e+02 8.924e+02, threshold=7.154e+02, percent-clipped=2.0 2023-06-25 04:01:23,566 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1251702.0, ans=0.0 2023-06-25 04:01:26,802 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1251762.0, ans=0.1 2023-06-25 04:01:40,786 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1251762.0, ans=0.0 2023-06-25 04:02:23,675 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1251882.0, ans=0.2 2023-06-25 04:02:40,183 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1251942.0, ans=0.125 2023-06-25 04:02:54,050 INFO [train.py:996] (2/4) Epoch 7, batch 25700, loss[loss=0.2543, simple_loss=0.3204, pruned_loss=0.09413, over 21579.00 frames. ], tot_loss[loss=0.219, simple_loss=0.2956, pruned_loss=0.0712, over 4269591.06 frames. ], batch size: 471, lr: 4.20e-03, grad_scale: 16.0 2023-06-25 04:02:59,487 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1252002.0, ans=0.125 2023-06-25 04:03:02,317 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.69 vs. limit=10.0 2023-06-25 04:03:07,565 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.07 vs. limit=15.0 2023-06-25 04:03:22,057 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.72 vs. limit=22.5 2023-06-25 04:04:12,271 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.64 vs. limit=22.5 2023-06-25 04:04:43,982 INFO [train.py:996] (2/4) Epoch 7, batch 25750, loss[loss=0.2616, simple_loss=0.3362, pruned_loss=0.09349, over 21543.00 frames. ], tot_loss[loss=0.2248, simple_loss=0.3008, pruned_loss=0.07438, over 4268421.42 frames. ], batch size: 414, lr: 4.20e-03, grad_scale: 16.0 2023-06-25 04:04:55,424 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.453e+02 3.207e+02 3.828e+02 5.534e+02 9.207e+02, threshold=7.655e+02, percent-clipped=4.0 2023-06-25 04:06:33,695 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.95 vs. limit=15.0 2023-06-25 04:06:41,409 INFO [train.py:996] (2/4) Epoch 7, batch 25800, loss[loss=0.2485, simple_loss=0.3305, pruned_loss=0.0833, over 20710.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3138, pruned_loss=0.07936, over 4263220.50 frames. ], batch size: 607, lr: 4.20e-03, grad_scale: 16.0 2023-06-25 04:07:40,817 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.20 vs. limit=15.0 2023-06-25 04:07:42,389 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.01 vs. limit=22.5 2023-06-25 04:07:44,120 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1252722.0, ans=0.125 2023-06-25 04:07:45,771 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1252722.0, ans=0.125 2023-06-25 04:08:08,221 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1252842.0, ans=0.0 2023-06-25 04:08:36,055 INFO [train.py:996] (2/4) Epoch 7, batch 25850, loss[loss=0.2289, simple_loss=0.2893, pruned_loss=0.08426, over 20146.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.314, pruned_loss=0.07834, over 4269413.11 frames. ], batch size: 702, lr: 4.20e-03, grad_scale: 16.0 2023-06-25 04:08:36,700 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 04:08:46,064 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.468e+02 3.799e+02 4.980e+02 7.138e+02 1.041e+03, threshold=9.960e+02, percent-clipped=14.0 2023-06-25 04:09:31,091 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=8.85 vs. limit=15.0 2023-06-25 04:09:55,251 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1253142.0, ans=0.5 2023-06-25 04:10:22,241 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.37 vs. limit=12.0 2023-06-25 04:10:24,682 INFO [train.py:996] (2/4) Epoch 7, batch 25900, loss[loss=0.2621, simple_loss=0.3725, pruned_loss=0.07587, over 20938.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3138, pruned_loss=0.07852, over 4276287.33 frames. ], batch size: 607, lr: 4.20e-03, grad_scale: 16.0 2023-06-25 04:10:51,062 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1253262.0, ans=0.0 2023-06-25 04:11:06,975 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1253262.0, ans=0.0 2023-06-25 04:11:32,245 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1253382.0, ans=0.125 2023-06-25 04:11:35,727 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1253382.0, ans=0.125 2023-06-25 04:11:45,035 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1253382.0, ans=0.125 2023-06-25 04:12:09,888 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1253442.0, ans=0.0 2023-06-25 04:12:19,420 INFO [train.py:996] (2/4) Epoch 7, batch 25950, loss[loss=0.2463, simple_loss=0.3245, pruned_loss=0.08406, over 21674.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.3197, pruned_loss=0.08067, over 4268774.36 frames. ], batch size: 298, lr: 4.20e-03, grad_scale: 16.0 2023-06-25 04:12:30,258 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.260e+02 3.924e+02 4.825e+02 6.667e+02 9.345e+02, threshold=9.651e+02, percent-clipped=0.0 2023-06-25 04:12:38,500 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.55 vs. limit=15.0 2023-06-25 04:13:52,947 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1253742.0, ans=0.125 2023-06-25 04:14:08,571 INFO [train.py:996] (2/4) Epoch 7, batch 26000, loss[loss=0.2321, simple_loss=0.3109, pruned_loss=0.07661, over 21269.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3182, pruned_loss=0.07901, over 4267792.38 frames. ], batch size: 176, lr: 4.20e-03, grad_scale: 32.0 2023-06-25 04:14:34,541 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1253862.0, ans=0.05 2023-06-25 04:15:14,057 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.48 vs. limit=12.0 2023-06-25 04:15:58,149 INFO [train.py:996] (2/4) Epoch 7, batch 26050, loss[loss=0.2477, simple_loss=0.3221, pruned_loss=0.08668, over 21896.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.3171, pruned_loss=0.08013, over 4277269.57 frames. ], batch size: 124, lr: 4.20e-03, grad_scale: 16.0 2023-06-25 04:16:04,516 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.46 vs. limit=15.0 2023-06-25 04:16:10,019 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.358e+02 3.188e+02 3.821e+02 5.430e+02 8.574e+02, threshold=7.643e+02, percent-clipped=0.0 2023-06-25 04:16:12,642 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1254102.0, ans=0.125 2023-06-25 04:16:14,809 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1254162.0, ans=0.125 2023-06-25 04:16:23,041 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1254162.0, ans=0.035 2023-06-25 04:16:23,049 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1254162.0, ans=0.125 2023-06-25 04:16:56,825 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1254222.0, ans=0.0 2023-06-25 04:17:45,919 INFO [train.py:996] (2/4) Epoch 7, batch 26100, loss[loss=0.2105, simple_loss=0.283, pruned_loss=0.06899, over 21458.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3129, pruned_loss=0.0798, over 4273594.61 frames. ], batch size: 131, lr: 4.20e-03, grad_scale: 16.0 2023-06-25 04:17:51,245 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1254402.0, ans=0.125 2023-06-25 04:18:39,039 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1254522.0, ans=0.125 2023-06-25 04:18:42,939 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.77 vs. limit=15.0 2023-06-25 04:19:31,080 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1254642.0, ans=0.1 2023-06-25 04:19:32,497 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1254642.0, ans=0.125 2023-06-25 04:19:35,053 INFO [train.py:996] (2/4) Epoch 7, batch 26150, loss[loss=0.23, simple_loss=0.2987, pruned_loss=0.08069, over 20948.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3101, pruned_loss=0.07994, over 4282618.81 frames. ], batch size: 608, lr: 4.20e-03, grad_scale: 16.0 2023-06-25 04:19:47,504 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.460e+02 3.240e+02 3.858e+02 5.306e+02 8.605e+02, threshold=7.716e+02, percent-clipped=2.0 2023-06-25 04:20:02,642 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.53 vs. limit=15.0 2023-06-25 04:20:31,975 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1254822.0, ans=0.125 2023-06-25 04:21:03,640 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1254882.0, ans=0.125 2023-06-25 04:21:06,045 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.05 vs. limit=22.5 2023-06-25 04:21:10,655 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1254942.0, ans=0.0 2023-06-25 04:21:24,110 INFO [train.py:996] (2/4) Epoch 7, batch 26200, loss[loss=0.223, simple_loss=0.3234, pruned_loss=0.06133, over 21650.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3109, pruned_loss=0.0779, over 4282491.87 frames. ], batch size: 263, lr: 4.20e-03, grad_scale: 16.0 2023-06-25 04:21:26,593 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1255002.0, ans=0.2 2023-06-25 04:21:28,563 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1255002.0, ans=0.0 2023-06-25 04:22:31,945 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 04:22:48,626 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.05 vs. limit=15.0 2023-06-25 04:23:12,012 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1255302.0, ans=0.09899494936611666 2023-06-25 04:23:13,410 INFO [train.py:996] (2/4) Epoch 7, batch 26250, loss[loss=0.2021, simple_loss=0.275, pruned_loss=0.06455, over 21472.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3138, pruned_loss=0.07705, over 4282280.60 frames. ], batch size: 194, lr: 4.19e-03, grad_scale: 16.0 2023-06-25 04:23:25,313 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.169e+02 3.172e+02 3.762e+02 4.925e+02 1.309e+03, threshold=7.524e+02, percent-clipped=5.0 2023-06-25 04:24:02,146 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1255422.0, ans=0.0 2023-06-25 04:25:01,098 INFO [train.py:996] (2/4) Epoch 7, batch 26300, loss[loss=0.2359, simple_loss=0.2994, pruned_loss=0.08618, over 21932.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3108, pruned_loss=0.07714, over 4284451.16 frames. ], batch size: 414, lr: 4.19e-03, grad_scale: 16.0 2023-06-25 04:25:02,117 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.93 vs. limit=22.5 2023-06-25 04:25:06,758 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1255602.0, ans=0.1 2023-06-25 04:25:24,334 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1255662.0, ans=0.0 2023-06-25 04:26:13,820 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1255782.0, ans=0.125 2023-06-25 04:26:31,368 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1255842.0, ans=0.125 2023-06-25 04:26:53,859 INFO [train.py:996] (2/4) Epoch 7, batch 26350, loss[loss=0.2563, simple_loss=0.3284, pruned_loss=0.0921, over 21308.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3094, pruned_loss=0.07752, over 4282174.78 frames. ], batch size: 176, lr: 4.19e-03, grad_scale: 16.0 2023-06-25 04:27:11,531 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.447e+02 3.110e+02 3.681e+02 4.505e+02 7.991e+02, threshold=7.361e+02, percent-clipped=2.0 2023-06-25 04:27:20,820 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1255962.0, ans=0.1 2023-06-25 04:27:27,880 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.14 vs. limit=22.5 2023-06-25 04:27:56,008 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1256022.0, ans=0.125 2023-06-25 04:28:36,620 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=1256142.0, ans=15.0 2023-06-25 04:28:40,471 INFO [train.py:996] (2/4) Epoch 7, batch 26400, loss[loss=0.2068, simple_loss=0.2621, pruned_loss=0.07577, over 21208.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.3029, pruned_loss=0.07713, over 4270789.98 frames. ], batch size: 144, lr: 4.19e-03, grad_scale: 32.0 2023-06-25 04:29:01,369 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1256202.0, ans=0.015 2023-06-25 04:29:07,634 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1256262.0, ans=0.125 2023-06-25 04:29:36,938 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.04 vs. limit=6.0 2023-06-25 04:29:39,862 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff3.min_abs, batch_count=1256322.0, ans=0.2 2023-06-25 04:30:39,803 INFO [train.py:996] (2/4) Epoch 7, batch 26450, loss[loss=0.2428, simple_loss=0.3721, pruned_loss=0.05674, over 20767.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.3038, pruned_loss=0.0767, over 4263023.07 frames. ], batch size: 607, lr: 4.19e-03, grad_scale: 32.0 2023-06-25 04:30:57,250 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.311e+02 3.534e+02 4.471e+02 5.534e+02 1.801e+03, threshold=8.941e+02, percent-clipped=10.0 2023-06-25 04:31:29,829 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1256622.0, ans=0.0 2023-06-25 04:32:35,141 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1256802.0, ans=0.125 2023-06-25 04:32:36,251 INFO [train.py:996] (2/4) Epoch 7, batch 26500, loss[loss=0.1696, simple_loss=0.232, pruned_loss=0.0536, over 21793.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.3051, pruned_loss=0.07578, over 4261712.57 frames. ], batch size: 102, lr: 4.19e-03, grad_scale: 32.0 2023-06-25 04:32:36,929 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1256802.0, ans=0.1 2023-06-25 04:32:40,520 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1256802.0, ans=0.2 2023-06-25 04:33:04,767 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.80 vs. limit=15.0 2023-06-25 04:33:17,785 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.06 vs. limit=8.0 2023-06-25 04:34:33,097 INFO [train.py:996] (2/4) Epoch 7, batch 26550, loss[loss=0.1972, simple_loss=0.2901, pruned_loss=0.05217, over 21736.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.3018, pruned_loss=0.07347, over 4261812.99 frames. ], batch size: 332, lr: 4.19e-03, grad_scale: 16.0 2023-06-25 04:34:47,491 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.196e+02 3.332e+02 4.391e+02 7.235e+02 1.419e+03, threshold=8.782e+02, percent-clipped=20.0 2023-06-25 04:34:58,730 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1257162.0, ans=0.2 2023-06-25 04:35:04,811 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.91 vs. limit=6.0 2023-06-25 04:35:48,399 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.71 vs. limit=15.0 2023-06-25 04:36:03,204 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1257342.0, ans=0.0 2023-06-25 04:36:16,836 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1257342.0, ans=0.125 2023-06-25 04:36:21,185 INFO [train.py:996] (2/4) Epoch 7, batch 26600, loss[loss=0.2015, simple_loss=0.2777, pruned_loss=0.06265, over 21588.00 frames. ], tot_loss[loss=0.2215, simple_loss=0.3018, pruned_loss=0.0706, over 4257771.35 frames. ], batch size: 263, lr: 4.19e-03, grad_scale: 16.0 2023-06-25 04:36:23,752 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1257402.0, ans=0.1 2023-06-25 04:36:41,274 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1257402.0, ans=0.125 2023-06-25 04:36:42,050 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.31 vs. limit=15.0 2023-06-25 04:36:50,194 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1257462.0, ans=0.125 2023-06-25 04:37:29,897 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1257582.0, ans=0.0 2023-06-25 04:38:10,080 INFO [train.py:996] (2/4) Epoch 7, batch 26650, loss[loss=0.165, simple_loss=0.2563, pruned_loss=0.03684, over 21803.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.2947, pruned_loss=0.06957, over 4256053.68 frames. ], batch size: 352, lr: 4.19e-03, grad_scale: 16.0 2023-06-25 04:38:28,646 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.814e+02 2.895e+02 3.400e+02 5.153e+02 1.068e+03, threshold=6.799e+02, percent-clipped=4.0 2023-06-25 04:39:22,864 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 04:39:26,070 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1257882.0, ans=0.125 2023-06-25 04:39:47,314 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1257942.0, ans=0.0 2023-06-25 04:39:54,612 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1257942.0, ans=0.125 2023-06-25 04:39:54,647 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1257942.0, ans=0.0 2023-06-25 04:39:57,607 INFO [train.py:996] (2/4) Epoch 7, batch 26700, loss[loss=0.1918, simple_loss=0.2649, pruned_loss=0.05934, over 21791.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.2886, pruned_loss=0.0673, over 4256164.41 frames. ], batch size: 247, lr: 4.19e-03, grad_scale: 16.0 2023-06-25 04:40:55,139 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1258122.0, ans=0.125 2023-06-25 04:40:55,822 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.04 vs. limit=12.0 2023-06-25 04:41:19,760 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1258182.0, ans=0.035 2023-06-25 04:41:32,735 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.14 vs. limit=22.5 2023-06-25 04:41:33,816 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1258242.0, ans=0.125 2023-06-25 04:41:52,592 INFO [train.py:996] (2/4) Epoch 7, batch 26750, loss[loss=0.219, simple_loss=0.3032, pruned_loss=0.06739, over 21677.00 frames. ], tot_loss[loss=0.2105, simple_loss=0.2882, pruned_loss=0.06638, over 4257446.78 frames. ], batch size: 298, lr: 4.19e-03, grad_scale: 16.0 2023-06-25 04:42:06,353 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.047e+02 2.716e+02 3.514e+02 4.569e+02 1.217e+03, threshold=7.028e+02, percent-clipped=8.0 2023-06-25 04:42:08,899 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1258362.0, ans=0.07 2023-06-25 04:42:11,658 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.04 vs. limit=15.0 2023-06-25 04:42:27,342 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1258362.0, ans=0.125 2023-06-25 04:42:36,039 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1258362.0, ans=0.125 2023-06-25 04:42:43,416 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1258422.0, ans=0.0 2023-06-25 04:42:49,039 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.02 vs. limit=22.5 2023-06-25 04:43:43,451 INFO [train.py:996] (2/4) Epoch 7, batch 26800, loss[loss=0.2934, simple_loss=0.3565, pruned_loss=0.1151, over 21445.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.2954, pruned_loss=0.0705, over 4261965.61 frames. ], batch size: 471, lr: 4.19e-03, grad_scale: 32.0 2023-06-25 04:43:49,879 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1258602.0, ans=0.125 2023-06-25 04:43:54,804 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1258602.0, ans=0.04949747468305833 2023-06-25 04:44:17,385 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.24 vs. limit=15.0 2023-06-25 04:45:12,455 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.94 vs. limit=6.0 2023-06-25 04:45:32,698 INFO [train.py:996] (2/4) Epoch 7, batch 26850, loss[loss=0.2039, simple_loss=0.261, pruned_loss=0.0734, over 21249.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.2962, pruned_loss=0.07246, over 4267261.31 frames. ], batch size: 159, lr: 4.19e-03, grad_scale: 32.0 2023-06-25 04:45:50,978 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1258902.0, ans=0.09899494936611666 2023-06-25 04:45:58,736 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.727e+02 3.580e+02 4.511e+02 5.580e+02 1.314e+03, threshold=9.022e+02, percent-clipped=13.0 2023-06-25 04:47:22,403 INFO [train.py:996] (2/4) Epoch 7, batch 26900, loss[loss=0.1807, simple_loss=0.2453, pruned_loss=0.0581, over 21637.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.2883, pruned_loss=0.07176, over 4263669.24 frames. ], batch size: 298, lr: 4.19e-03, grad_scale: 16.0 2023-06-25 04:47:46,084 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=1259262.0, ans=15.0 2023-06-25 04:49:05,797 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1259502.0, ans=0.5 2023-06-25 04:49:06,720 INFO [train.py:996] (2/4) Epoch 7, batch 26950, loss[loss=0.2483, simple_loss=0.3312, pruned_loss=0.08273, over 21604.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2885, pruned_loss=0.07228, over 4270747.11 frames. ], batch size: 414, lr: 4.19e-03, grad_scale: 16.0 2023-06-25 04:49:12,577 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1259502.0, ans=0.125 2023-06-25 04:49:33,611 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.559e+02 3.020e+02 3.484e+02 4.294e+02 8.554e+02, threshold=6.967e+02, percent-clipped=0.0 2023-06-25 04:49:49,841 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1259562.0, ans=0.125 2023-06-25 04:49:53,408 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1259622.0, ans=0.0 2023-06-25 04:50:50,034 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.61 vs. limit=15.0 2023-06-25 04:51:02,051 INFO [train.py:996] (2/4) Epoch 7, batch 27000, loss[loss=0.1893, simple_loss=0.2768, pruned_loss=0.05092, over 21625.00 frames. ], tot_loss[loss=0.2164, simple_loss=0.291, pruned_loss=0.07093, over 4276743.99 frames. ], batch size: 263, lr: 4.19e-03, grad_scale: 16.0 2023-06-25 04:51:02,052 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-25 04:51:17,841 INFO [zipformer.py:1728] (2/4) name=encoder.encoders.4.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([1.6925, 2.3093, 3.7003, 2.3605], device='cuda:2') 2023-06-25 04:51:24,280 INFO [train.py:1028] (2/4) Epoch 7, validation: loss=0.2512, simple_loss=0.3463, pruned_loss=0.07806, over 1796401.00 frames. 2023-06-25 04:51:24,281 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 23731MB 2023-06-25 04:52:02,873 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1259862.0, ans=0.125 2023-06-25 04:52:23,250 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.31 vs. limit=6.0 2023-06-25 04:52:52,348 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1260042.0, ans=0.1 2023-06-25 04:53:08,333 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1260042.0, ans=0.1 2023-06-25 04:53:14,907 INFO [train.py:996] (2/4) Epoch 7, batch 27050, loss[loss=0.2174, simple_loss=0.3004, pruned_loss=0.06722, over 21867.00 frames. ], tot_loss[loss=0.2149, simple_loss=0.2939, pruned_loss=0.06794, over 4283910.33 frames. ], batch size: 316, lr: 4.19e-03, grad_scale: 16.0 2023-06-25 04:53:34,735 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.941e+02 2.897e+02 3.762e+02 4.771e+02 8.226e+02, threshold=7.524e+02, percent-clipped=2.0 2023-06-25 04:54:25,559 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1260282.0, ans=0.0 2023-06-25 04:54:27,047 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1260282.0, ans=0.125 2023-06-25 04:54:35,985 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.07 vs. limit=15.0 2023-06-25 04:55:04,185 INFO [train.py:996] (2/4) Epoch 7, batch 27100, loss[loss=0.2266, simple_loss=0.3128, pruned_loss=0.07022, over 21228.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2965, pruned_loss=0.06919, over 4285345.19 frames. ], batch size: 159, lr: 4.19e-03, grad_scale: 16.0 2023-06-25 04:55:04,821 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1260402.0, ans=0.0 2023-06-25 04:55:05,370 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.23 vs. limit=12.0 2023-06-25 04:55:05,515 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.17 vs. limit=15.0 2023-06-25 04:55:35,474 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=10.24 vs. limit=15.0 2023-06-25 04:56:08,584 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1260582.0, ans=0.125 2023-06-25 04:56:53,968 INFO [train.py:996] (2/4) Epoch 7, batch 27150, loss[loss=0.2251, simple_loss=0.3174, pruned_loss=0.06636, over 21263.00 frames. ], tot_loss[loss=0.227, simple_loss=0.3083, pruned_loss=0.07282, over 4284831.18 frames. ], batch size: 176, lr: 4.19e-03, grad_scale: 16.0 2023-06-25 04:57:19,850 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.408e+02 3.400e+02 4.098e+02 5.830e+02 1.178e+03, threshold=8.196e+02, percent-clipped=9.0 2023-06-25 04:57:31,238 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1260762.0, ans=0.125 2023-06-25 04:57:38,082 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 04:57:52,089 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1260822.0, ans=0.07 2023-06-25 04:58:10,885 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1260882.0, ans=0.2 2023-06-25 04:58:44,430 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 04:58:53,809 INFO [train.py:996] (2/4) Epoch 7, batch 27200, loss[loss=0.2702, simple_loss=0.353, pruned_loss=0.09368, over 21748.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3165, pruned_loss=0.07562, over 4285365.90 frames. ], batch size: 351, lr: 4.19e-03, grad_scale: 32.0 2023-06-25 04:59:08,661 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1261002.0, ans=0.04949747468305833 2023-06-25 04:59:22,066 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.93 vs. limit=6.0 2023-06-25 04:59:42,743 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1261122.0, ans=0.0 2023-06-25 04:59:46,412 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1261122.0, ans=0.05 2023-06-25 04:59:50,236 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1261122.0, ans=10.0 2023-06-25 05:00:01,131 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.45 vs. limit=6.0 2023-06-25 05:00:02,447 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1261182.0, ans=0.1 2023-06-25 05:00:44,392 INFO [train.py:996] (2/4) Epoch 7, batch 27250, loss[loss=0.2594, simple_loss=0.3327, pruned_loss=0.09304, over 21328.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3177, pruned_loss=0.07843, over 4279139.36 frames. ], batch size: 159, lr: 4.18e-03, grad_scale: 16.0 2023-06-25 05:01:01,715 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1261362.0, ans=0.04949747468305833 2023-06-25 05:01:02,614 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.541e+02 3.239e+02 3.756e+02 4.583e+02 7.251e+02, threshold=7.513e+02, percent-clipped=0.0 2023-06-25 05:01:23,026 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1261362.0, ans=0.07 2023-06-25 05:01:23,618 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.85 vs. limit=15.0 2023-06-25 05:01:37,788 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.78 vs. limit=15.0 2023-06-25 05:02:36,164 INFO [train.py:996] (2/4) Epoch 7, batch 27300, loss[loss=0.2336, simple_loss=0.3293, pruned_loss=0.06901, over 21912.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3181, pruned_loss=0.07955, over 4269375.96 frames. ], batch size: 372, lr: 4.18e-03, grad_scale: 16.0 2023-06-25 05:03:20,995 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.28 vs. limit=15.0 2023-06-25 05:04:18,240 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1261842.0, ans=0.0 2023-06-25 05:04:26,405 INFO [train.py:996] (2/4) Epoch 7, batch 27350, loss[loss=0.2379, simple_loss=0.3238, pruned_loss=0.07597, over 21811.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.3206, pruned_loss=0.08076, over 4269965.72 frames. ], batch size: 118, lr: 4.18e-03, grad_scale: 16.0 2023-06-25 05:04:48,249 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.380e+02 3.470e+02 4.790e+02 5.992e+02 9.415e+02, threshold=9.580e+02, percent-clipped=9.0 2023-06-25 05:05:08,991 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1261962.0, ans=0.1 2023-06-25 05:06:18,602 INFO [train.py:996] (2/4) Epoch 7, batch 27400, loss[loss=0.207, simple_loss=0.2774, pruned_loss=0.06831, over 21782.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3172, pruned_loss=0.08004, over 4274513.74 frames. ], batch size: 351, lr: 4.18e-03, grad_scale: 16.0 2023-06-25 05:06:29,078 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.27 vs. limit=15.0 2023-06-25 05:06:50,267 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1262262.0, ans=0.125 2023-06-25 05:07:39,883 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1262382.0, ans=0.125 2023-06-25 05:08:08,415 INFO [train.py:996] (2/4) Epoch 7, batch 27450, loss[loss=0.2063, simple_loss=0.2853, pruned_loss=0.06361, over 21317.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3116, pruned_loss=0.07845, over 4268823.63 frames. ], batch size: 211, lr: 4.18e-03, grad_scale: 16.0 2023-06-25 05:08:36,545 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.458e+02 3.140e+02 3.820e+02 5.353e+02 9.307e+02, threshold=7.640e+02, percent-clipped=0.0 2023-06-25 05:09:35,438 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1262742.0, ans=0.0 2023-06-25 05:09:50,501 INFO [train.py:996] (2/4) Epoch 7, batch 27500, loss[loss=0.2095, simple_loss=0.2857, pruned_loss=0.06669, over 21269.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3096, pruned_loss=0.07844, over 4275638.93 frames. ], batch size: 176, lr: 4.18e-03, grad_scale: 16.0 2023-06-25 05:10:06,273 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1262802.0, ans=0.2 2023-06-25 05:11:37,914 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.46 vs. limit=22.5 2023-06-25 05:11:43,529 INFO [train.py:996] (2/4) Epoch 7, batch 27550, loss[loss=0.1932, simple_loss=0.2636, pruned_loss=0.06142, over 21754.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.3033, pruned_loss=0.07506, over 4279271.79 frames. ], batch size: 124, lr: 4.18e-03, grad_scale: 16.0 2023-06-25 05:12:10,964 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.562e+02 3.311e+02 4.001e+02 4.826e+02 1.149e+03, threshold=8.002e+02, percent-clipped=4.0 2023-06-25 05:12:15,539 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1263162.0, ans=0.125 2023-06-25 05:12:36,109 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1263222.0, ans=0.125 2023-06-25 05:12:56,292 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1263282.0, ans=0.1 2023-06-25 05:13:29,726 INFO [train.py:996] (2/4) Epoch 7, batch 27600, loss[loss=0.2551, simple_loss=0.3147, pruned_loss=0.09773, over 14840.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.2961, pruned_loss=0.0737, over 4277352.98 frames. ], batch size: 60, lr: 4.18e-03, grad_scale: 32.0 2023-06-25 05:14:09,139 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 05:14:53,716 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1263642.0, ans=0.125 2023-06-25 05:15:10,404 INFO [train.py:996] (2/4) Epoch 7, batch 27650, loss[loss=0.2057, simple_loss=0.2852, pruned_loss=0.06313, over 21858.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.2905, pruned_loss=0.07322, over 4270232.34 frames. ], batch size: 316, lr: 4.18e-03, grad_scale: 32.0 2023-06-25 05:15:37,188 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.286e+02 3.109e+02 3.684e+02 5.059e+02 1.214e+03, threshold=7.368e+02, percent-clipped=6.0 2023-06-25 05:15:58,667 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1263822.0, ans=0.125 2023-06-25 05:16:40,833 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1263942.0, ans=0.2 2023-06-25 05:16:54,743 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1263942.0, ans=0.125 2023-06-25 05:16:57,728 INFO [train.py:996] (2/4) Epoch 7, batch 27700, loss[loss=0.2256, simple_loss=0.2996, pruned_loss=0.07581, over 21320.00 frames. ], tot_loss[loss=0.2164, simple_loss=0.2904, pruned_loss=0.07122, over 4276212.08 frames. ], batch size: 176, lr: 4.18e-03, grad_scale: 16.0 2023-06-25 05:17:21,520 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.87 vs. limit=22.5 2023-06-25 05:17:25,907 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1264062.0, ans=0.125 2023-06-25 05:18:00,548 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1264122.0, ans=0.5 2023-06-25 05:18:10,822 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1264182.0, ans=0.2 2023-06-25 05:18:36,639 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1264242.0, ans=0.1 2023-06-25 05:18:49,463 INFO [train.py:996] (2/4) Epoch 7, batch 27750, loss[loss=0.1925, simple_loss=0.2879, pruned_loss=0.04852, over 20911.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.2933, pruned_loss=0.07061, over 4278666.29 frames. ], batch size: 608, lr: 4.18e-03, grad_scale: 16.0 2023-06-25 05:19:19,127 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.393e+02 2.962e+02 3.488e+02 4.454e+02 9.416e+02, threshold=6.976e+02, percent-clipped=4.0 2023-06-25 05:19:28,854 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.63 vs. limit=22.5 2023-06-25 05:19:40,084 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1264422.0, ans=0.125 2023-06-25 05:19:46,577 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1264422.0, ans=0.2 2023-06-25 05:20:01,962 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1264482.0, ans=0.1 2023-06-25 05:20:36,082 INFO [train.py:996] (2/4) Epoch 7, batch 27800, loss[loss=0.2144, simple_loss=0.2853, pruned_loss=0.07176, over 21926.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2922, pruned_loss=0.07096, over 4278375.08 frames. ], batch size: 316, lr: 4.18e-03, grad_scale: 16.0 2023-06-25 05:20:44,026 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1264602.0, ans=0.2 2023-06-25 05:20:52,068 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1264602.0, ans=0.125 2023-06-25 05:20:57,133 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1264662.0, ans=0.0 2023-06-25 05:21:16,293 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1264662.0, ans=0.1 2023-06-25 05:21:44,087 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1264782.0, ans=0.1 2023-06-25 05:22:24,425 INFO [train.py:996] (2/4) Epoch 7, batch 27850, loss[loss=0.2086, simple_loss=0.2726, pruned_loss=0.07232, over 21164.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.2917, pruned_loss=0.07275, over 4284432.61 frames. ], batch size: 608, lr: 4.18e-03, grad_scale: 8.0 2023-06-25 05:22:57,347 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.414e+02 3.134e+02 3.811e+02 5.096e+02 8.843e+02, threshold=7.621e+02, percent-clipped=7.0 2023-06-25 05:23:02,207 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.64 vs. limit=10.0 2023-06-25 05:23:10,775 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1264962.0, ans=0.125 2023-06-25 05:23:23,277 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1265022.0, ans=0.125 2023-06-25 05:24:27,037 INFO [train.py:996] (2/4) Epoch 7, batch 27900, loss[loss=0.2294, simple_loss=0.3086, pruned_loss=0.07504, over 21821.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.3028, pruned_loss=0.0748, over 4291296.65 frames. ], batch size: 112, lr: 4.18e-03, grad_scale: 8.0 2023-06-25 05:24:47,957 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1265262.0, ans=0.0 2023-06-25 05:26:21,605 INFO [train.py:996] (2/4) Epoch 7, batch 27950, loss[loss=0.2364, simple_loss=0.3282, pruned_loss=0.07232, over 21684.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.3031, pruned_loss=0.0716, over 4283223.99 frames. ], batch size: 441, lr: 4.18e-03, grad_scale: 8.0 2023-06-25 05:26:32,926 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1265502.0, ans=0.125 2023-06-25 05:26:42,605 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.145e+02 3.117e+02 4.053e+02 5.979e+02 1.114e+03, threshold=8.107e+02, percent-clipped=11.0 2023-06-25 05:28:09,599 INFO [train.py:996] (2/4) Epoch 7, batch 28000, loss[loss=0.1802, simple_loss=0.2588, pruned_loss=0.05082, over 21699.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.3008, pruned_loss=0.06939, over 4282391.75 frames. ], batch size: 263, lr: 4.18e-03, grad_scale: 16.0 2023-06-25 05:28:16,081 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1265802.0, ans=0.0 2023-06-25 05:28:28,908 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1265862.0, ans=0.125 2023-06-25 05:28:28,914 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1265862.0, ans=0.2 2023-06-25 05:28:37,525 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1265862.0, ans=0.0 2023-06-25 05:28:45,934 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.94 vs. limit=12.0 2023-06-25 05:28:47,393 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.35 vs. limit=6.0 2023-06-25 05:28:53,746 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1265922.0, ans=0.125 2023-06-25 05:29:07,814 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 05:30:01,619 INFO [train.py:996] (2/4) Epoch 7, batch 28050, loss[loss=0.2368, simple_loss=0.3331, pruned_loss=0.07024, over 21246.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.2981, pruned_loss=0.07053, over 4279640.61 frames. ], batch size: 548, lr: 4.18e-03, grad_scale: 16.0 2023-06-25 05:30:22,383 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.141e+02 2.952e+02 3.818e+02 5.160e+02 1.220e+03, threshold=7.636e+02, percent-clipped=4.0 2023-06-25 05:30:39,269 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.41 vs. limit=12.0 2023-06-25 05:31:02,920 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1266282.0, ans=0.125 2023-06-25 05:31:26,908 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.70 vs. limit=22.5 2023-06-25 05:31:27,943 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1266342.0, ans=0.0 2023-06-25 05:31:51,608 INFO [train.py:996] (2/4) Epoch 7, batch 28100, loss[loss=0.1868, simple_loss=0.2431, pruned_loss=0.06525, over 20790.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.2958, pruned_loss=0.06998, over 4271972.38 frames. ], batch size: 608, lr: 4.18e-03, grad_scale: 16.0 2023-06-25 05:31:52,868 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.77 vs. limit=15.0 2023-06-25 05:32:22,427 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.08 vs. limit=15.0 2023-06-25 05:33:04,792 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1266582.0, ans=0.0 2023-06-25 05:33:25,211 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1266642.0, ans=0.125 2023-06-25 05:33:25,252 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1266642.0, ans=0.125 2023-06-25 05:33:40,495 INFO [train.py:996] (2/4) Epoch 7, batch 28150, loss[loss=0.1719, simple_loss=0.2186, pruned_loss=0.06254, over 20792.00 frames. ], tot_loss[loss=0.2137, simple_loss=0.2876, pruned_loss=0.06986, over 4265562.56 frames. ], batch size: 609, lr: 4.18e-03, grad_scale: 16.0 2023-06-25 05:34:01,759 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.438e+02 3.369e+02 4.176e+02 5.786e+02 1.041e+03, threshold=8.353e+02, percent-clipped=8.0 2023-06-25 05:34:07,726 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 05:34:38,499 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1266822.0, ans=0.0 2023-06-25 05:35:29,310 INFO [train.py:996] (2/4) Epoch 7, batch 28200, loss[loss=0.2253, simple_loss=0.2862, pruned_loss=0.08219, over 21844.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2871, pruned_loss=0.07097, over 4255289.29 frames. ], batch size: 98, lr: 4.18e-03, grad_scale: 16.0 2023-06-25 05:37:06,480 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1267242.0, ans=0.0 2023-06-25 05:37:17,992 INFO [train.py:996] (2/4) Epoch 7, batch 28250, loss[loss=0.2339, simple_loss=0.2937, pruned_loss=0.087, over 21571.00 frames. ], tot_loss[loss=0.22, simple_loss=0.2921, pruned_loss=0.07399, over 4251732.32 frames. ], batch size: 441, lr: 4.17e-03, grad_scale: 16.0 2023-06-25 05:37:18,531 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1267302.0, ans=0.0 2023-06-25 05:37:43,657 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.555e+02 3.449e+02 4.309e+02 5.866e+02 1.082e+03, threshold=8.618e+02, percent-clipped=6.0 2023-06-25 05:37:44,903 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.41 vs. limit=15.0 2023-06-25 05:37:57,585 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.91 vs. limit=15.0 2023-06-25 05:38:25,677 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1267482.0, ans=0.125 2023-06-25 05:38:32,483 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1267482.0, ans=0.0 2023-06-25 05:38:43,154 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1267482.0, ans=0.0 2023-06-25 05:38:43,767 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.80 vs. limit=12.0 2023-06-25 05:39:01,966 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1267542.0, ans=0.125 2023-06-25 05:39:08,742 INFO [train.py:996] (2/4) Epoch 7, batch 28300, loss[loss=0.1981, simple_loss=0.2902, pruned_loss=0.053, over 21785.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2897, pruned_loss=0.07218, over 4253075.39 frames. ], batch size: 371, lr: 4.17e-03, grad_scale: 16.0 2023-06-25 05:40:25,926 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1267782.0, ans=0.2 2023-06-25 05:40:32,461 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1267782.0, ans=0.125 2023-06-25 05:40:45,632 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.83 vs. limit=10.0 2023-06-25 05:40:48,689 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1267842.0, ans=0.125 2023-06-25 05:41:03,573 INFO [train.py:996] (2/4) Epoch 7, batch 28350, loss[loss=0.1908, simple_loss=0.26, pruned_loss=0.06077, over 21674.00 frames. ], tot_loss[loss=0.2098, simple_loss=0.2864, pruned_loss=0.06658, over 4258809.70 frames. ], batch size: 282, lr: 4.17e-03, grad_scale: 16.0 2023-06-25 05:41:29,703 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.944e+02 2.753e+02 3.449e+02 4.988e+02 1.144e+03, threshold=6.899e+02, percent-clipped=4.0 2023-06-25 05:41:50,969 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1268022.0, ans=0.1 2023-06-25 05:42:09,376 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.81 vs. limit=15.0 2023-06-25 05:42:12,220 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1268082.0, ans=0.125 2023-06-25 05:42:16,080 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.95 vs. limit=12.0 2023-06-25 05:42:51,344 INFO [train.py:996] (2/4) Epoch 7, batch 28400, loss[loss=0.2192, simple_loss=0.2985, pruned_loss=0.06992, over 21703.00 frames. ], tot_loss[loss=0.2073, simple_loss=0.2833, pruned_loss=0.06571, over 4258612.71 frames. ], batch size: 351, lr: 4.17e-03, grad_scale: 32.0 2023-06-25 05:42:55,970 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1268202.0, ans=0.1 2023-06-25 05:43:15,997 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1268262.0, ans=0.2 2023-06-25 05:44:35,331 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1268442.0, ans=0.1 2023-06-25 05:44:42,027 INFO [train.py:996] (2/4) Epoch 7, batch 28450, loss[loss=0.2498, simple_loss=0.3189, pruned_loss=0.09031, over 21830.00 frames. ], tot_loss[loss=0.2137, simple_loss=0.2883, pruned_loss=0.06957, over 4258221.14 frames. ], batch size: 414, lr: 4.17e-03, grad_scale: 16.0 2023-06-25 05:45:15,057 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.500e+02 3.249e+02 3.944e+02 5.811e+02 1.668e+03, threshold=7.889e+02, percent-clipped=19.0 2023-06-25 05:45:29,748 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1268622.0, ans=0.125 2023-06-25 05:45:52,790 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1268682.0, ans=0.0 2023-06-25 05:46:20,935 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.07 vs. limit=15.0 2023-06-25 05:46:34,022 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.77 vs. limit=12.0 2023-06-25 05:46:36,231 INFO [train.py:996] (2/4) Epoch 7, batch 28500, loss[loss=0.2064, simple_loss=0.2706, pruned_loss=0.07108, over 21199.00 frames. ], tot_loss[loss=0.2183, simple_loss=0.2913, pruned_loss=0.07259, over 4270789.18 frames. ], batch size: 608, lr: 4.17e-03, grad_scale: 16.0 2023-06-25 05:47:29,657 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1268922.0, ans=0.0 2023-06-25 05:48:02,173 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.32 vs. limit=15.0 2023-06-25 05:48:31,035 INFO [train.py:996] (2/4) Epoch 7, batch 28550, loss[loss=0.2401, simple_loss=0.3329, pruned_loss=0.07368, over 21444.00 frames. ], tot_loss[loss=0.2248, simple_loss=0.2995, pruned_loss=0.07502, over 4270541.60 frames. ], batch size: 211, lr: 4.17e-03, grad_scale: 16.0 2023-06-25 05:48:53,559 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.539e+02 3.516e+02 4.419e+02 5.883e+02 1.246e+03, threshold=8.838e+02, percent-clipped=8.0 2023-06-25 05:49:17,163 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1269222.0, ans=0.1 2023-06-25 05:49:18,794 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1269222.0, ans=0.0 2023-06-25 05:49:31,627 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.82 vs. limit=15.0 2023-06-25 05:50:18,708 INFO [train.py:996] (2/4) Epoch 7, batch 28600, loss[loss=0.3263, simple_loss=0.3712, pruned_loss=0.1407, over 21416.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.3058, pruned_loss=0.07761, over 4271099.67 frames. ], batch size: 509, lr: 4.17e-03, grad_scale: 16.0 2023-06-25 05:50:30,411 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1269402.0, ans=0.125 2023-06-25 05:50:51,276 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1269462.0, ans=0.125 2023-06-25 05:51:02,973 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1269522.0, ans=0.0 2023-06-25 05:51:03,053 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_na.min_abs, batch_count=1269522.0, ans=0.02 2023-06-25 05:51:18,566 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1269522.0, ans=0.0 2023-06-25 05:51:23,598 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1269582.0, ans=0.0 2023-06-25 05:51:23,610 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1269582.0, ans=0.0 2023-06-25 05:51:24,060 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.95 vs. limit=22.5 2023-06-25 05:52:07,670 INFO [train.py:996] (2/4) Epoch 7, batch 28650, loss[loss=0.2055, simple_loss=0.2701, pruned_loss=0.07043, over 21996.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.3002, pruned_loss=0.0767, over 4272597.25 frames. ], batch size: 375, lr: 4.17e-03, grad_scale: 16.0 2023-06-25 05:52:11,693 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1269702.0, ans=0.125 2023-06-25 05:52:14,581 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1269702.0, ans=0.125 2023-06-25 05:52:30,247 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.450e+02 3.536e+02 4.575e+02 6.589e+02 8.896e+02, threshold=9.150e+02, percent-clipped=1.0 2023-06-25 05:52:30,923 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1269762.0, ans=0.1 2023-06-25 05:52:36,450 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1269762.0, ans=0.125 2023-06-25 05:52:57,441 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1269822.0, ans=0.0 2023-06-25 05:53:55,690 INFO [train.py:996] (2/4) Epoch 7, batch 28700, loss[loss=0.2139, simple_loss=0.2816, pruned_loss=0.07308, over 20721.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.2997, pruned_loss=0.07707, over 4278008.70 frames. ], batch size: 607, lr: 4.17e-03, grad_scale: 16.0 2023-06-25 05:54:06,726 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1270002.0, ans=0.07 2023-06-25 05:54:14,881 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1270062.0, ans=0.125 2023-06-25 05:54:23,658 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1270062.0, ans=0.125 2023-06-25 05:55:04,550 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.59 vs. limit=22.5 2023-06-25 05:55:25,298 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1270242.0, ans=0.0 2023-06-25 05:55:29,283 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.86 vs. limit=12.0 2023-06-25 05:55:43,752 INFO [train.py:996] (2/4) Epoch 7, batch 28750, loss[loss=0.2182, simple_loss=0.3046, pruned_loss=0.06587, over 21784.00 frames. ], tot_loss[loss=0.227, simple_loss=0.2994, pruned_loss=0.07737, over 4286091.43 frames. ], batch size: 414, lr: 4.17e-03, grad_scale: 16.0 2023-06-25 05:56:06,409 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.645e+02 3.238e+02 3.725e+02 5.020e+02 9.578e+02, threshold=7.449e+02, percent-clipped=2.0 2023-06-25 05:57:09,319 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1270482.0, ans=0.125 2023-06-25 05:57:33,231 INFO [train.py:996] (2/4) Epoch 7, batch 28800, loss[loss=0.2984, simple_loss=0.3593, pruned_loss=0.1187, over 21451.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3018, pruned_loss=0.07781, over 4282221.94 frames. ], batch size: 471, lr: 4.17e-03, grad_scale: 32.0 2023-06-25 05:57:47,333 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1270602.0, ans=0.0 2023-06-25 05:57:58,053 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1270662.0, ans=0.04949747468305833 2023-06-25 05:58:54,091 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.41 vs. limit=22.5 2023-06-25 05:58:55,847 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1270782.0, ans=0.125 2023-06-25 05:59:15,054 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1270842.0, ans=0.125 2023-06-25 05:59:22,083 INFO [train.py:996] (2/4) Epoch 7, batch 28850, loss[loss=0.2239, simple_loss=0.2911, pruned_loss=0.0783, over 21311.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.302, pruned_loss=0.07857, over 4284671.84 frames. ], batch size: 159, lr: 4.17e-03, grad_scale: 16.0 2023-06-25 06:00:02,823 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.622e+02 3.393e+02 4.119e+02 6.059e+02 1.112e+03, threshold=8.239e+02, percent-clipped=12.0 2023-06-25 06:00:28,212 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1271022.0, ans=0.125 2023-06-25 06:00:32,620 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_ff3.min_abs, batch_count=1271022.0, ans=0.2 2023-06-25 06:01:17,978 INFO [train.py:996] (2/4) Epoch 7, batch 28900, loss[loss=0.2305, simple_loss=0.3007, pruned_loss=0.08013, over 21362.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3074, pruned_loss=0.08076, over 4284317.33 frames. ], batch size: 159, lr: 4.17e-03, grad_scale: 16.0 2023-06-25 06:01:28,891 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1271202.0, ans=0.1 2023-06-25 06:01:51,528 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1271262.0, ans=0.0 2023-06-25 06:03:09,347 INFO [train.py:996] (2/4) Epoch 7, batch 28950, loss[loss=0.2358, simple_loss=0.3368, pruned_loss=0.06741, over 21690.00 frames. ], tot_loss[loss=0.235, simple_loss=0.309, pruned_loss=0.0805, over 4280930.66 frames. ], batch size: 414, lr: 4.17e-03, grad_scale: 8.0 2023-06-25 06:03:13,863 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1271502.0, ans=0.07 2023-06-25 06:03:46,133 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.351e+02 3.609e+02 4.387e+02 5.987e+02 1.071e+03, threshold=8.774e+02, percent-clipped=6.0 2023-06-25 06:04:14,558 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1271682.0, ans=0.0 2023-06-25 06:04:46,461 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1271742.0, ans=0.0 2023-06-25 06:05:02,834 INFO [train.py:996] (2/4) Epoch 7, batch 29000, loss[loss=0.245, simple_loss=0.3056, pruned_loss=0.09223, over 20149.00 frames. ], tot_loss[loss=0.236, simple_loss=0.312, pruned_loss=0.07999, over 4278266.06 frames. ], batch size: 707, lr: 4.17e-03, grad_scale: 8.0 2023-06-25 06:05:05,236 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1271802.0, ans=0.05 2023-06-25 06:05:58,397 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1271922.0, ans=0.09899494936611666 2023-06-25 06:06:47,478 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.65 vs. limit=15.0 2023-06-25 06:06:51,574 INFO [train.py:996] (2/4) Epoch 7, batch 29050, loss[loss=0.2433, simple_loss=0.3131, pruned_loss=0.08671, over 21872.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.3117, pruned_loss=0.08164, over 4288467.54 frames. ], batch size: 107, lr: 4.17e-03, grad_scale: 8.0 2023-06-25 06:07:21,615 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.493e+02 3.635e+02 4.186e+02 5.307e+02 1.029e+03, threshold=8.372e+02, percent-clipped=1.0 2023-06-25 06:07:36,126 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1272222.0, ans=0.07 2023-06-25 06:08:03,437 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1272282.0, ans=0.0 2023-06-25 06:08:37,195 INFO [train.py:996] (2/4) Epoch 7, batch 29100, loss[loss=0.1979, simple_loss=0.2613, pruned_loss=0.06722, over 21782.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.3029, pruned_loss=0.07891, over 4292622.66 frames. ], batch size: 124, lr: 4.17e-03, grad_scale: 8.0 2023-06-25 06:09:00,226 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.71 vs. limit=15.0 2023-06-25 06:09:11,639 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1272462.0, ans=0.125 2023-06-25 06:09:11,641 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1272462.0, ans=0.125 2023-06-25 06:09:30,666 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1272522.0, ans=0.1 2023-06-25 06:09:49,780 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 06:09:55,433 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1272642.0, ans=0.125 2023-06-25 06:10:23,691 INFO [train.py:996] (2/4) Epoch 7, batch 29150, loss[loss=0.2528, simple_loss=0.3262, pruned_loss=0.08974, over 21518.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.3027, pruned_loss=0.07722, over 4275724.83 frames. ], batch size: 389, lr: 4.17e-03, grad_scale: 8.0 2023-06-25 06:10:54,198 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.266e+02 3.210e+02 4.222e+02 5.476e+02 9.873e+02, threshold=8.444e+02, percent-clipped=1.0 2023-06-25 06:10:54,877 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1272762.0, ans=0.1 2023-06-25 06:11:15,250 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1272822.0, ans=0.0 2023-06-25 06:12:10,648 INFO [train.py:996] (2/4) Epoch 7, batch 29200, loss[loss=0.2183, simple_loss=0.2721, pruned_loss=0.08219, over 21506.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.2977, pruned_loss=0.07645, over 4269942.51 frames. ], batch size: 441, lr: 4.17e-03, grad_scale: 16.0 2023-06-25 06:12:43,221 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1273062.0, ans=0.125 2023-06-25 06:14:03,025 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.64 vs. limit=22.5 2023-06-25 06:14:05,293 INFO [train.py:996] (2/4) Epoch 7, batch 29250, loss[loss=0.2532, simple_loss=0.3438, pruned_loss=0.08136, over 21620.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.2955, pruned_loss=0.07391, over 4263114.57 frames. ], batch size: 442, lr: 4.16e-03, grad_scale: 16.0 2023-06-25 06:14:05,841 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1273302.0, ans=0.125 2023-06-25 06:14:31,560 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.408e+02 3.162e+02 4.067e+02 5.479e+02 1.081e+03, threshold=8.134e+02, percent-clipped=3.0 2023-06-25 06:14:51,788 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 06:14:53,620 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1273422.0, ans=0.125 2023-06-25 06:15:53,723 INFO [train.py:996] (2/4) Epoch 7, batch 29300, loss[loss=0.2106, simple_loss=0.284, pruned_loss=0.0686, over 21488.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.2975, pruned_loss=0.07304, over 4264975.66 frames. ], batch size: 389, lr: 4.16e-03, grad_scale: 16.0 2023-06-25 06:16:48,662 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=11.20 vs. limit=15.0 2023-06-25 06:16:54,891 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1273782.0, ans=0.125 2023-06-25 06:17:10,588 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1273782.0, ans=0.2 2023-06-25 06:17:26,565 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1273842.0, ans=0.1 2023-06-25 06:17:29,970 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1273842.0, ans=0.0 2023-06-25 06:17:42,121 INFO [train.py:996] (2/4) Epoch 7, batch 29350, loss[loss=0.179, simple_loss=0.2449, pruned_loss=0.05649, over 21574.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.2924, pruned_loss=0.07167, over 4252937.17 frames. ], batch size: 263, lr: 4.16e-03, grad_scale: 16.0 2023-06-25 06:17:44,372 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1273902.0, ans=0.125 2023-06-25 06:18:13,786 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.302e+02 3.026e+02 3.822e+02 5.352e+02 1.093e+03, threshold=7.644e+02, percent-clipped=3.0 2023-06-25 06:19:30,173 INFO [train.py:996] (2/4) Epoch 7, batch 29400, loss[loss=0.2061, simple_loss=0.3138, pruned_loss=0.0492, over 20782.00 frames. ], tot_loss[loss=0.2149, simple_loss=0.2905, pruned_loss=0.06963, over 4249964.12 frames. ], batch size: 609, lr: 4.16e-03, grad_scale: 16.0 2023-06-25 06:20:48,931 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1274382.0, ans=0.0 2023-06-25 06:20:52,794 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1274382.0, ans=0.2 2023-06-25 06:20:54,648 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1274382.0, ans=0.125 2023-06-25 06:20:56,204 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1274382.0, ans=0.0 2023-06-25 06:21:18,701 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1274502.0, ans=0.125 2023-06-25 06:21:19,307 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.47 vs. limit=22.5 2023-06-25 06:21:20,150 INFO [train.py:996] (2/4) Epoch 7, batch 29450, loss[loss=0.2447, simple_loss=0.3257, pruned_loss=0.08186, over 21907.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.2869, pruned_loss=0.06761, over 4251173.28 frames. ], batch size: 372, lr: 4.16e-03, grad_scale: 8.0 2023-06-25 06:21:22,749 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1274502.0, ans=0.125 2023-06-25 06:21:22,762 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1274502.0, ans=0.0 2023-06-25 06:21:53,725 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.339e+02 3.532e+02 4.385e+02 5.559e+02 1.410e+03, threshold=8.770e+02, percent-clipped=9.0 2023-06-25 06:22:05,991 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.81 vs. limit=22.5 2023-06-25 06:22:34,057 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1274682.0, ans=0.2 2023-06-25 06:22:44,638 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1274682.0, ans=0.1 2023-06-25 06:23:08,504 INFO [train.py:996] (2/4) Epoch 7, batch 29500, loss[loss=0.2255, simple_loss=0.3342, pruned_loss=0.05838, over 19758.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.2934, pruned_loss=0.07116, over 4259642.56 frames. ], batch size: 703, lr: 4.16e-03, grad_scale: 8.0 2023-06-25 06:23:15,751 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1274802.0, ans=0.2 2023-06-25 06:24:39,082 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1275042.0, ans=0.0 2023-06-25 06:24:56,260 INFO [train.py:996] (2/4) Epoch 7, batch 29550, loss[loss=0.2255, simple_loss=0.2811, pruned_loss=0.08499, over 21602.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.2922, pruned_loss=0.07271, over 4267790.90 frames. ], batch size: 548, lr: 4.16e-03, grad_scale: 8.0 2023-06-25 06:25:10,201 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.99 vs. limit=15.0 2023-06-25 06:25:11,529 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1275102.0, ans=0.2 2023-06-25 06:25:25,996 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1275162.0, ans=0.125 2023-06-25 06:25:30,056 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.666e+02 3.932e+02 4.748e+02 5.685e+02 9.373e+02, threshold=9.495e+02, percent-clipped=3.0 2023-06-25 06:26:45,582 INFO [train.py:996] (2/4) Epoch 7, batch 29600, loss[loss=0.2429, simple_loss=0.3308, pruned_loss=0.0775, over 21788.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.299, pruned_loss=0.07501, over 4275244.05 frames. ], batch size: 298, lr: 4.16e-03, grad_scale: 16.0 2023-06-25 06:27:00,286 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1275402.0, ans=0.125 2023-06-25 06:27:04,158 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.70 vs. limit=15.0 2023-06-25 06:27:21,158 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1275462.0, ans=0.0 2023-06-25 06:27:29,959 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1275462.0, ans=0.0 2023-06-25 06:27:58,056 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.66 vs. limit=10.0 2023-06-25 06:28:14,804 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1275642.0, ans=0.0 2023-06-25 06:28:26,893 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1275642.0, ans=0.125 2023-06-25 06:28:31,755 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 06:28:33,295 INFO [train.py:996] (2/4) Epoch 7, batch 29650, loss[loss=0.1814, simple_loss=0.2545, pruned_loss=0.05417, over 21676.00 frames. ], tot_loss[loss=0.2214, simple_loss=0.298, pruned_loss=0.07238, over 4280026.96 frames. ], batch size: 230, lr: 4.16e-03, grad_scale: 16.0 2023-06-25 06:28:57,786 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1275762.0, ans=0.125 2023-06-25 06:29:16,850 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.311e+02 3.458e+02 4.326e+02 5.325e+02 1.074e+03, threshold=8.651e+02, percent-clipped=3.0 2023-06-25 06:29:39,971 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1275822.0, ans=0.125 2023-06-25 06:30:26,695 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=1276002.0, ans=15.0 2023-06-25 06:30:27,062 INFO [train.py:996] (2/4) Epoch 7, batch 29700, loss[loss=0.2499, simple_loss=0.3565, pruned_loss=0.07165, over 21792.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.299, pruned_loss=0.07283, over 4287836.37 frames. ], batch size: 282, lr: 4.16e-03, grad_scale: 16.0 2023-06-25 06:31:51,015 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1276182.0, ans=0.1 2023-06-25 06:32:16,257 INFO [train.py:996] (2/4) Epoch 7, batch 29750, loss[loss=0.2364, simple_loss=0.3274, pruned_loss=0.07268, over 21727.00 frames. ], tot_loss[loss=0.225, simple_loss=0.3049, pruned_loss=0.07254, over 4282213.30 frames. ], batch size: 441, lr: 4.16e-03, grad_scale: 16.0 2023-06-25 06:32:54,079 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.514e+02 3.299e+02 3.896e+02 4.722e+02 1.232e+03, threshold=7.792e+02, percent-clipped=5.0 2023-06-25 06:32:54,490 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1276362.0, ans=0.1 2023-06-25 06:33:22,691 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1276482.0, ans=0.125 2023-06-25 06:33:22,769 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1276482.0, ans=0.1 2023-06-25 06:34:03,398 INFO [train.py:996] (2/4) Epoch 7, batch 29800, loss[loss=0.2196, simple_loss=0.2953, pruned_loss=0.07196, over 21227.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.3055, pruned_loss=0.07279, over 4286132.42 frames. ], batch size: 143, lr: 4.16e-03, grad_scale: 16.0 2023-06-25 06:35:24,433 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1276782.0, ans=0.0 2023-06-25 06:35:35,964 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1276842.0, ans=0.2 2023-06-25 06:35:50,572 INFO [train.py:996] (2/4) Epoch 7, batch 29850, loss[loss=0.1731, simple_loss=0.2543, pruned_loss=0.04599, over 15934.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.3012, pruned_loss=0.07186, over 4276276.77 frames. ], batch size: 60, lr: 4.16e-03, grad_scale: 16.0 2023-06-25 06:36:00,064 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=12.73 vs. limit=15.0 2023-06-25 06:36:28,324 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.285e+02 2.948e+02 3.373e+02 4.045e+02 7.832e+02, threshold=6.745e+02, percent-clipped=1.0 2023-06-25 06:37:12,979 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1277142.0, ans=0.07 2023-06-25 06:37:36,841 INFO [train.py:996] (2/4) Epoch 7, batch 29900, loss[loss=0.2408, simple_loss=0.3053, pruned_loss=0.08817, over 21846.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.2993, pruned_loss=0.07273, over 4288281.25 frames. ], batch size: 247, lr: 4.16e-03, grad_scale: 16.0 2023-06-25 06:37:37,540 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1277202.0, ans=0.0 2023-06-25 06:38:08,956 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1277262.0, ans=0.1 2023-06-25 06:38:36,773 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1277322.0, ans=10.0 2023-06-25 06:39:25,230 INFO [train.py:996] (2/4) Epoch 7, batch 29950, loss[loss=0.2737, simple_loss=0.3432, pruned_loss=0.1022, over 21275.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.3039, pruned_loss=0.07684, over 4288024.59 frames. ], batch size: 159, lr: 4.16e-03, grad_scale: 16.0 2023-06-25 06:39:49,967 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1277502.0, ans=0.1 2023-06-25 06:39:58,583 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1277562.0, ans=0.0 2023-06-25 06:40:08,701 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.749e+02 3.319e+02 4.450e+02 5.387e+02 9.920e+02, threshold=8.899e+02, percent-clipped=12.0 2023-06-25 06:40:10,963 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1277562.0, ans=0.2 2023-06-25 06:40:37,467 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1277682.0, ans=0.125 2023-06-25 06:40:38,000 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.59 vs. limit=15.0 2023-06-25 06:40:38,123 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.21 vs. limit=15.0 2023-06-25 06:40:50,350 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1277682.0, ans=0.1 2023-06-25 06:41:19,266 INFO [train.py:996] (2/4) Epoch 7, batch 30000, loss[loss=0.2077, simple_loss=0.3069, pruned_loss=0.05428, over 21662.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.305, pruned_loss=0.07646, over 4286655.49 frames. ], batch size: 414, lr: 4.16e-03, grad_scale: 32.0 2023-06-25 06:41:19,266 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-25 06:41:33,800 INFO [zipformer.py:1728] (2/4) name=encoder.encoders.3.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([1.9339, 2.4036, 2.4891, 2.7680, 2.3620, 2.3893, 2.7753, 2.7149], device='cuda:2') 2023-06-25 06:41:39,223 INFO [train.py:1028] (2/4) Epoch 7, validation: loss=0.2493, simple_loss=0.346, pruned_loss=0.07628, over 1796401.00 frames. 2023-06-25 06:41:39,224 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 23731MB 2023-06-25 06:41:48,021 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1277802.0, ans=0.125 2023-06-25 06:42:05,326 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1277862.0, ans=0.0 2023-06-25 06:42:06,788 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1277862.0, ans=0.1 2023-06-25 06:42:39,553 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1277922.0, ans=0.2 2023-06-25 06:43:30,377 INFO [train.py:996] (2/4) Epoch 7, batch 30050, loss[loss=0.2608, simple_loss=0.3825, pruned_loss=0.06958, over 21185.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3089, pruned_loss=0.07429, over 4282640.55 frames. ], batch size: 548, lr: 4.16e-03, grad_scale: 16.0 2023-06-25 06:43:37,771 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1278102.0, ans=0.1 2023-06-25 06:43:48,955 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1278102.0, ans=0.0 2023-06-25 06:43:49,002 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1278102.0, ans=0.125 2023-06-25 06:43:54,350 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1278162.0, ans=0.125 2023-06-25 06:44:05,682 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.420e+02 3.279e+02 4.155e+02 5.724e+02 1.149e+03, threshold=8.309e+02, percent-clipped=6.0 2023-06-25 06:44:20,536 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1278222.0, ans=0.1 2023-06-25 06:44:29,253 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1278222.0, ans=0.125 2023-06-25 06:45:02,970 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1278342.0, ans=0.2 2023-06-25 06:45:16,743 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1278402.0, ans=0.125 2023-06-25 06:45:17,751 INFO [train.py:996] (2/4) Epoch 7, batch 30100, loss[loss=0.2172, simple_loss=0.2738, pruned_loss=0.08031, over 21223.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.3085, pruned_loss=0.07419, over 4271734.64 frames. ], batch size: 159, lr: 4.16e-03, grad_scale: 16.0 2023-06-25 06:46:10,147 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1278522.0, ans=0.125 2023-06-25 06:46:12,229 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1278522.0, ans=0.125 2023-06-25 06:46:29,301 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1278582.0, ans=0.0 2023-06-25 06:46:57,719 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.20 vs. limit=15.0 2023-06-25 06:47:10,881 INFO [train.py:996] (2/4) Epoch 7, batch 30150, loss[loss=0.2382, simple_loss=0.3112, pruned_loss=0.08255, over 21795.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.3045, pruned_loss=0.07546, over 4266178.67 frames. ], batch size: 333, lr: 4.16e-03, grad_scale: 16.0 2023-06-25 06:47:34,786 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=1278762.0, ans=15.0 2023-06-25 06:47:47,379 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.588e+02 3.267e+02 3.809e+02 4.984e+02 9.103e+02, threshold=7.618e+02, percent-clipped=3.0 2023-06-25 06:47:48,133 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1278762.0, ans=0.0 2023-06-25 06:48:56,774 INFO [train.py:996] (2/4) Epoch 7, batch 30200, loss[loss=0.1955, simple_loss=0.2822, pruned_loss=0.05442, over 21421.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.3071, pruned_loss=0.07466, over 4268580.69 frames. ], batch size: 211, lr: 4.16e-03, grad_scale: 16.0 2023-06-25 06:50:09,012 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1279182.0, ans=0.125 2023-06-25 06:50:49,637 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.56 vs. limit=10.0 2023-06-25 06:50:59,061 INFO [train.py:996] (2/4) Epoch 7, batch 30250, loss[loss=0.2401, simple_loss=0.3232, pruned_loss=0.07844, over 21255.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3151, pruned_loss=0.07781, over 4271451.64 frames. ], batch size: 143, lr: 4.16e-03, grad_scale: 16.0 2023-06-25 06:51:10,921 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1279302.0, ans=0.1 2023-06-25 06:51:12,754 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1279302.0, ans=0.125 2023-06-25 06:51:28,763 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1279362.0, ans=0.125 2023-06-25 06:51:29,404 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.43 vs. limit=12.0 2023-06-25 06:51:33,137 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.437e+02 3.334e+02 4.601e+02 6.960e+02 1.343e+03, threshold=9.203e+02, percent-clipped=16.0 2023-06-25 06:51:39,353 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1279422.0, ans=0.0 2023-06-25 06:52:13,782 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1279482.0, ans=0.2 2023-06-25 06:52:33,755 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.47 vs. limit=8.0 2023-06-25 06:52:41,304 INFO [train.py:996] (2/4) Epoch 7, batch 30300, loss[loss=0.1892, simple_loss=0.2549, pruned_loss=0.06174, over 21513.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3118, pruned_loss=0.07719, over 4267385.04 frames. ], batch size: 263, lr: 4.15e-03, grad_scale: 16.0 2023-06-25 06:53:10,053 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1279662.0, ans=0.0 2023-06-25 06:53:44,413 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 06:54:03,597 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1279782.0, ans=0.1 2023-06-25 06:54:27,438 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1279842.0, ans=0.0 2023-06-25 06:54:37,783 INFO [train.py:996] (2/4) Epoch 7, batch 30350, loss[loss=0.2288, simple_loss=0.3181, pruned_loss=0.06977, over 21816.00 frames. ], tot_loss[loss=0.236, simple_loss=0.3142, pruned_loss=0.07891, over 4274762.82 frames. ], batch size: 352, lr: 4.15e-03, grad_scale: 16.0 2023-06-25 06:55:05,064 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.424e+02 3.772e+02 4.635e+02 6.721e+02 1.384e+03, threshold=9.269e+02, percent-clipped=9.0 2023-06-25 06:55:10,016 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1280022.0, ans=0.125 2023-06-25 06:55:12,255 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.82 vs. limit=15.0 2023-06-25 06:55:15,011 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.45 vs. limit=10.0 2023-06-25 06:55:37,647 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1280082.0, ans=0.1 2023-06-25 06:55:40,423 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1280142.0, ans=0.0 2023-06-25 06:56:00,348 INFO [train.py:996] (2/4) Epoch 7, batch 30400, loss[loss=0.2177, simple_loss=0.2618, pruned_loss=0.08678, over 20268.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3073, pruned_loss=0.0775, over 4261811.51 frames. ], batch size: 703, lr: 4.15e-03, grad_scale: 32.0 2023-06-25 06:56:15,399 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1280202.0, ans=0.125 2023-06-25 06:57:12,204 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1280382.0, ans=0.0 2023-06-25 06:57:33,117 INFO [train.py:996] (2/4) Epoch 7, batch 30450, loss[loss=0.2616, simple_loss=0.3784, pruned_loss=0.07239, over 19877.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.3077, pruned_loss=0.07709, over 4202104.13 frames. ], batch size: 702, lr: 4.15e-03, grad_scale: 32.0 2023-06-25 06:58:02,610 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.728e+02 6.501e+02 9.013e+02 1.486e+03 3.895e+03, threshold=1.803e+03, percent-clipped=46.0 2023-06-25 06:58:12,143 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1280622.0, ans=0.2 2023-06-25 06:58:18,078 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1280622.0, ans=0.0 2023-06-25 06:58:38,887 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1280742.0, ans=10.0 2023-06-25 07:01:02,074 INFO [train.py:996] (2/4) Epoch 8, batch 0, loss[loss=0.2134, simple_loss=0.2853, pruned_loss=0.0708, over 21594.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2853, pruned_loss=0.0708, over 21594.00 frames. ], batch size: 298, lr: 3.86e-03, grad_scale: 32.0 2023-06-25 07:01:02,075 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-25 07:01:19,570 INFO [train.py:1028] (2/4) Epoch 8, validation: loss=0.2406, simple_loss=0.3467, pruned_loss=0.06724, over 1796401.00 frames. 2023-06-25 07:01:19,571 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 23731MB 2023-06-25 07:01:20,651 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.05 vs. limit=22.5 2023-06-25 07:02:41,212 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1281012.0, ans=0.0 2023-06-25 07:03:05,851 INFO [train.py:996] (2/4) Epoch 8, batch 50, loss[loss=0.2951, simple_loss=0.3713, pruned_loss=0.1094, over 21479.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3139, pruned_loss=0.07417, over 972386.79 frames. ], batch size: 471, lr: 3.86e-03, grad_scale: 32.0 2023-06-25 07:03:13,083 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1281072.0, ans=0.0 2023-06-25 07:03:31,390 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 07:03:49,764 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.736e+02 3.478e+02 5.204e+02 1.094e+03 2.896e+03, threshold=1.041e+03, percent-clipped=7.0 2023-06-25 07:04:28,739 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.49 vs. limit=22.5 2023-06-25 07:04:51,359 INFO [train.py:996] (2/4) Epoch 8, batch 100, loss[loss=0.2442, simple_loss=0.3506, pruned_loss=0.06888, over 21257.00 frames. ], tot_loss[loss=0.2421, simple_loss=0.3288, pruned_loss=0.07765, over 1701285.24 frames. ], batch size: 176, lr: 3.86e-03, grad_scale: 16.0 2023-06-25 07:05:09,084 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.94 vs. limit=15.0 2023-06-25 07:05:25,216 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1281432.0, ans=0.2 2023-06-25 07:06:37,757 INFO [train.py:996] (2/4) Epoch 8, batch 150, loss[loss=0.2303, simple_loss=0.3022, pruned_loss=0.07916, over 21876.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.3315, pruned_loss=0.07776, over 2273954.27 frames. ], batch size: 118, lr: 3.86e-03, grad_scale: 16.0 2023-06-25 07:07:03,698 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1281732.0, ans=0.09899494936611666 2023-06-25 07:07:13,013 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.83 vs. limit=15.0 2023-06-25 07:07:27,453 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.475e+02 3.041e+02 3.436e+02 4.359e+02 9.068e+02, threshold=6.872e+02, percent-clipped=0.0 2023-06-25 07:07:44,944 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1281852.0, ans=0.0 2023-06-25 07:08:04,037 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1281912.0, ans=0.0 2023-06-25 07:08:18,562 INFO [train.py:996] (2/4) Epoch 8, batch 200, loss[loss=0.2365, simple_loss=0.3511, pruned_loss=0.06096, over 20736.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3232, pruned_loss=0.07506, over 2718984.31 frames. ], batch size: 607, lr: 3.86e-03, grad_scale: 16.0 2023-06-25 07:09:33,614 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1282152.0, ans=0.125 2023-06-25 07:09:59,999 INFO [train.py:996] (2/4) Epoch 8, batch 250, loss[loss=0.2252, simple_loss=0.2968, pruned_loss=0.07676, over 21799.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3187, pruned_loss=0.07535, over 3069743.35 frames. ], batch size: 298, lr: 3.86e-03, grad_scale: 16.0 2023-06-25 07:10:24,980 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1282332.0, ans=0.125 2023-06-25 07:10:43,260 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.02 vs. limit=15.0 2023-06-25 07:10:45,282 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.611e+02 3.498e+02 4.445e+02 5.647e+02 1.101e+03, threshold=8.891e+02, percent-clipped=14.0 2023-06-25 07:11:23,088 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1282452.0, ans=0.125 2023-06-25 07:11:33,432 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1282512.0, ans=0.0 2023-06-25 07:11:49,074 INFO [train.py:996] (2/4) Epoch 8, batch 300, loss[loss=0.2118, simple_loss=0.3212, pruned_loss=0.05125, over 19826.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.3104, pruned_loss=0.07398, over 3332820.83 frames. ], batch size: 703, lr: 3.86e-03, grad_scale: 16.0 2023-06-25 07:12:03,906 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1282572.0, ans=0.125 2023-06-25 07:13:21,340 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.65 vs. limit=15.0 2023-06-25 07:13:39,781 INFO [train.py:996] (2/4) Epoch 8, batch 350, loss[loss=0.1932, simple_loss=0.2579, pruned_loss=0.0643, over 21348.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.3035, pruned_loss=0.07335, over 3537761.38 frames. ], batch size: 160, lr: 3.86e-03, grad_scale: 16.0 2023-06-25 07:14:30,125 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.317e+02 3.131e+02 3.897e+02 5.934e+02 1.239e+03, threshold=7.794e+02, percent-clipped=5.0 2023-06-25 07:15:14,747 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1283112.0, ans=0.125 2023-06-25 07:15:16,534 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1283112.0, ans=0.1 2023-06-25 07:15:24,676 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1283112.0, ans=0.0 2023-06-25 07:15:27,518 INFO [train.py:996] (2/4) Epoch 8, batch 400, loss[loss=0.2172, simple_loss=0.3318, pruned_loss=0.05135, over 20856.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.2987, pruned_loss=0.07307, over 3695441.38 frames. ], batch size: 608, lr: 3.86e-03, grad_scale: 32.0 2023-06-25 07:15:52,745 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1283232.0, ans=0.1 2023-06-25 07:16:46,764 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1283352.0, ans=0.2 2023-06-25 07:16:56,483 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1283352.0, ans=0.0 2023-06-25 07:17:11,300 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1283412.0, ans=0.125 2023-06-25 07:17:19,213 INFO [train.py:996] (2/4) Epoch 8, batch 450, loss[loss=0.1734, simple_loss=0.2388, pruned_loss=0.05399, over 21128.00 frames. ], tot_loss[loss=0.2192, simple_loss=0.2965, pruned_loss=0.07097, over 3826731.22 frames. ], batch size: 176, lr: 3.86e-03, grad_scale: 32.0 2023-06-25 07:17:21,745 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1283472.0, ans=0.1 2023-06-25 07:17:56,068 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1283532.0, ans=0.2 2023-06-25 07:18:16,643 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.309e+02 3.535e+02 4.359e+02 5.649e+02 1.208e+03, threshold=8.718e+02, percent-clipped=9.0 2023-06-25 07:19:01,980 INFO [train.py:996] (2/4) Epoch 8, batch 500, loss[loss=0.194, simple_loss=0.291, pruned_loss=0.04851, over 21276.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.298, pruned_loss=0.07013, over 3914297.17 frames. ], batch size: 131, lr: 3.86e-03, grad_scale: 16.0 2023-06-25 07:19:06,104 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1283772.0, ans=0.125 2023-06-25 07:19:12,629 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1283772.0, ans=0.125 2023-06-25 07:20:14,287 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.27 vs. limit=15.0 2023-06-25 07:20:48,083 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1284072.0, ans=0.1 2023-06-25 07:20:49,174 INFO [train.py:996] (2/4) Epoch 8, batch 550, loss[loss=0.2024, simple_loss=0.2636, pruned_loss=0.07059, over 21309.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.2994, pruned_loss=0.06977, over 4001350.58 frames. ], batch size: 144, lr: 3.85e-03, grad_scale: 16.0 2023-06-25 07:20:50,473 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.08 vs. limit=15.0 2023-06-25 07:21:11,766 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1284132.0, ans=0.125 2023-06-25 07:21:45,845 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.415e+02 3.578e+02 5.101e+02 7.574e+02 1.639e+03, threshold=1.020e+03, percent-clipped=17.0 2023-06-25 07:21:49,608 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1284192.0, ans=0.125 2023-06-25 07:21:53,027 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1284192.0, ans=0.0 2023-06-25 07:22:01,094 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1284252.0, ans=0.04949747468305833 2023-06-25 07:22:28,424 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.51 vs. limit=10.0 2023-06-25 07:22:28,849 INFO [train.py:996] (2/4) Epoch 8, batch 600, loss[loss=0.1888, simple_loss=0.2425, pruned_loss=0.06758, over 20000.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.3003, pruned_loss=0.07028, over 4063413.84 frames. ], batch size: 704, lr: 3.85e-03, grad_scale: 16.0 2023-06-25 07:23:05,594 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1284432.0, ans=0.0 2023-06-25 07:23:24,403 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1284492.0, ans=0.125 2023-06-25 07:23:35,742 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.94 vs. limit=10.0 2023-06-25 07:23:41,603 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1284552.0, ans=0.125 2023-06-25 07:24:14,801 INFO [train.py:996] (2/4) Epoch 8, batch 650, loss[loss=0.2136, simple_loss=0.2777, pruned_loss=0.07479, over 21538.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.3024, pruned_loss=0.07084, over 4120814.20 frames. ], batch size: 230, lr: 3.85e-03, grad_scale: 16.0 2023-06-25 07:24:39,484 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1284732.0, ans=0.125 2023-06-25 07:25:01,899 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1284792.0, ans=0.1 2023-06-25 07:25:04,120 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.46 vs. limit=15.0 2023-06-25 07:25:16,314 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.375e+02 3.313e+02 4.571e+02 7.176e+02 1.629e+03, threshold=9.143e+02, percent-clipped=10.0 2023-06-25 07:25:25,345 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1284852.0, ans=0.0 2023-06-25 07:25:39,295 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.14 vs. limit=12.0 2023-06-25 07:26:00,086 INFO [train.py:996] (2/4) Epoch 8, batch 700, loss[loss=0.2118, simple_loss=0.2922, pruned_loss=0.06568, over 21768.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.3046, pruned_loss=0.07096, over 4152273.14 frames. ], batch size: 351, lr: 3.85e-03, grad_scale: 16.0 2023-06-25 07:26:09,576 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1284972.0, ans=0.2 2023-06-25 07:26:25,446 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.25 vs. limit=15.0 2023-06-25 07:27:26,499 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.11 vs. limit=15.0 2023-06-25 07:27:34,922 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.56 vs. limit=15.0 2023-06-25 07:27:44,314 INFO [train.py:996] (2/4) Epoch 8, batch 750, loss[loss=0.2205, simple_loss=0.2828, pruned_loss=0.07906, over 21890.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.3027, pruned_loss=0.07139, over 4187697.81 frames. ], batch size: 98, lr: 3.85e-03, grad_scale: 16.0 2023-06-25 07:28:16,821 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.17 vs. limit=6.0 2023-06-25 07:28:47,156 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.536e+02 3.601e+02 4.438e+02 5.764e+02 1.140e+03, threshold=8.877e+02, percent-clipped=3.0 2023-06-25 07:29:17,462 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1285512.0, ans=0.0 2023-06-25 07:29:32,236 INFO [train.py:996] (2/4) Epoch 8, batch 800, loss[loss=0.1871, simple_loss=0.2557, pruned_loss=0.05927, over 21777.00 frames. ], tot_loss[loss=0.221, simple_loss=0.2991, pruned_loss=0.07146, over 4203316.89 frames. ], batch size: 351, lr: 3.85e-03, grad_scale: 32.0 2023-06-25 07:29:45,025 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1285572.0, ans=0.1 2023-06-25 07:30:39,615 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1285692.0, ans=0.0 2023-06-25 07:31:25,109 INFO [train.py:996] (2/4) Epoch 8, batch 850, loss[loss=0.2054, simple_loss=0.2766, pruned_loss=0.06714, over 21517.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.2971, pruned_loss=0.07172, over 4224299.33 frames. ], batch size: 212, lr: 3.85e-03, grad_scale: 16.0 2023-06-25 07:31:29,109 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1285872.0, ans=0.125 2023-06-25 07:31:38,050 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.87 vs. limit=10.0 2023-06-25 07:31:44,883 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.53 vs. limit=15.0 2023-06-25 07:32:23,992 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.482e+02 3.234e+02 3.833e+02 4.866e+02 9.722e+02, threshold=7.666e+02, percent-clipped=1.0 2023-06-25 07:32:58,242 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.78 vs. limit=22.5 2023-06-25 07:33:12,645 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.whiten.whitening_limit, batch_count=1286172.0, ans=12.0 2023-06-25 07:33:13,051 INFO [train.py:996] (2/4) Epoch 8, batch 900, loss[loss=0.2027, simple_loss=0.2734, pruned_loss=0.06598, over 21491.00 frames. ], tot_loss[loss=0.2181, simple_loss=0.2939, pruned_loss=0.07111, over 4237774.54 frames. ], batch size: 194, lr: 3.85e-03, grad_scale: 16.0 2023-06-25 07:33:29,616 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1286232.0, ans=0.0 2023-06-25 07:34:14,402 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.22 vs. limit=22.5 2023-06-25 07:35:01,271 INFO [train.py:996] (2/4) Epoch 8, batch 950, loss[loss=0.2224, simple_loss=0.2945, pruned_loss=0.07512, over 21713.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2916, pruned_loss=0.07072, over 4244046.16 frames. ], batch size: 230, lr: 3.85e-03, grad_scale: 16.0 2023-06-25 07:35:09,021 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1286472.0, ans=0.1 2023-06-25 07:35:41,638 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1286532.0, ans=0.0 2023-06-25 07:35:54,407 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.907e+02 3.602e+02 4.628e+02 6.707e+02 1.446e+03, threshold=9.256e+02, percent-clipped=20.0 2023-06-25 07:36:17,702 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=6.28 vs. limit=22.5 2023-06-25 07:36:37,124 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.69 vs. limit=22.5 2023-06-25 07:36:42,690 INFO [train.py:996] (2/4) Epoch 8, batch 1000, loss[loss=0.2344, simple_loss=0.3059, pruned_loss=0.08141, over 21674.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2923, pruned_loss=0.07099, over 4254403.81 frames. ], batch size: 230, lr: 3.85e-03, grad_scale: 16.0 2023-06-25 07:36:47,292 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1286772.0, ans=0.125 2023-06-25 07:38:22,437 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=1287012.0, ans=6.0 2023-06-25 07:38:31,296 INFO [train.py:996] (2/4) Epoch 8, batch 1050, loss[loss=0.1739, simple_loss=0.276, pruned_loss=0.03594, over 21759.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2922, pruned_loss=0.07085, over 4267316.91 frames. ], batch size: 332, lr: 3.85e-03, grad_scale: 16.0 2023-06-25 07:38:32,118 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1287072.0, ans=0.2 2023-06-25 07:39:30,744 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.534e+02 3.439e+02 4.407e+02 5.715e+02 1.308e+03, threshold=8.815e+02, percent-clipped=4.0 2023-06-25 07:39:33,167 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1287192.0, ans=0.2 2023-06-25 07:40:19,277 INFO [train.py:996] (2/4) Epoch 8, batch 1100, loss[loss=0.2276, simple_loss=0.3092, pruned_loss=0.07297, over 21472.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.2914, pruned_loss=0.07022, over 4263050.05 frames. ], batch size: 194, lr: 3.85e-03, grad_scale: 16.0 2023-06-25 07:40:19,802 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1287372.0, ans=0.0 2023-06-25 07:41:03,961 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1287492.0, ans=0.0 2023-06-25 07:41:23,493 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1287552.0, ans=0.0 2023-06-25 07:41:48,952 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.51 vs. limit=15.0 2023-06-25 07:42:15,460 INFO [train.py:996] (2/4) Epoch 8, batch 1150, loss[loss=0.2455, simple_loss=0.3043, pruned_loss=0.09329, over 21823.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2917, pruned_loss=0.0705, over 4267587.35 frames. ], batch size: 441, lr: 3.85e-03, grad_scale: 16.0 2023-06-25 07:42:33,431 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1287732.0, ans=0.0 2023-06-25 07:42:40,609 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1287732.0, ans=0.0 2023-06-25 07:42:42,551 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1287732.0, ans=0.125 2023-06-25 07:42:49,540 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1287792.0, ans=0.0 2023-06-25 07:42:59,622 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.638e+02 3.529e+02 4.325e+02 5.726e+02 1.140e+03, threshold=8.649e+02, percent-clipped=5.0 2023-06-25 07:43:10,197 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.18 vs. limit=15.0 2023-06-25 07:43:21,717 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1287852.0, ans=0.125 2023-06-25 07:43:41,481 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.24 vs. limit=6.0 2023-06-25 07:43:53,313 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1287912.0, ans=0.1 2023-06-25 07:43:56,846 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1287912.0, ans=0.125 2023-06-25 07:43:58,414 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1287972.0, ans=0.125 2023-06-25 07:43:58,458 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1287972.0, ans=0.125 2023-06-25 07:43:59,497 INFO [train.py:996] (2/4) Epoch 8, batch 1200, loss[loss=0.2018, simple_loss=0.2924, pruned_loss=0.05561, over 21830.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2933, pruned_loss=0.07048, over 4273000.54 frames. ], batch size: 282, lr: 3.85e-03, grad_scale: 32.0 2023-06-25 07:44:11,608 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1287972.0, ans=0.1 2023-06-25 07:44:21,943 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.15 vs. limit=15.0 2023-06-25 07:44:26,972 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.68 vs. limit=15.0 2023-06-25 07:44:41,993 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1288092.0, ans=0.09899494936611666 2023-06-25 07:45:10,646 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1288152.0, ans=10.0 2023-06-25 07:45:23,206 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1288212.0, ans=0.125 2023-06-25 07:45:47,938 INFO [train.py:996] (2/4) Epoch 8, batch 1250, loss[loss=0.2197, simple_loss=0.2967, pruned_loss=0.07138, over 21858.00 frames. ], tot_loss[loss=0.2192, simple_loss=0.2957, pruned_loss=0.07134, over 4282891.05 frames. ], batch size: 351, lr: 3.85e-03, grad_scale: 32.0 2023-06-25 07:45:59,137 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1288272.0, ans=0.5 2023-06-25 07:46:04,461 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1288332.0, ans=0.125 2023-06-25 07:46:19,052 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1288332.0, ans=0.2 2023-06-25 07:46:22,270 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1288392.0, ans=0.0 2023-06-25 07:46:25,782 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1288392.0, ans=0.0 2023-06-25 07:46:38,023 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.488e+02 3.316e+02 4.127e+02 5.335e+02 1.234e+03, threshold=8.255e+02, percent-clipped=5.0 2023-06-25 07:47:29,515 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.60 vs. limit=15.0 2023-06-25 07:47:36,809 INFO [train.py:996] (2/4) Epoch 8, batch 1300, loss[loss=0.2349, simple_loss=0.2977, pruned_loss=0.08606, over 21340.00 frames. ], tot_loss[loss=0.2212, simple_loss=0.298, pruned_loss=0.07223, over 4287303.52 frames. ], batch size: 159, lr: 3.85e-03, grad_scale: 16.0 2023-06-25 07:47:41,081 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1288572.0, ans=0.0 2023-06-25 07:47:53,551 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.41 vs. limit=15.0 2023-06-25 07:47:54,847 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1288632.0, ans=0.125 2023-06-25 07:47:56,414 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1288632.0, ans=0.2 2023-06-25 07:48:40,157 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1288752.0, ans=0.125 2023-06-25 07:48:55,857 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1288812.0, ans=0.0 2023-06-25 07:49:25,902 INFO [train.py:996] (2/4) Epoch 8, batch 1350, loss[loss=0.3022, simple_loss=0.3594, pruned_loss=0.1226, over 21382.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.2992, pruned_loss=0.07287, over 4287704.47 frames. ], batch size: 509, lr: 3.85e-03, grad_scale: 16.0 2023-06-25 07:50:11,432 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1288992.0, ans=0.125 2023-06-25 07:50:15,511 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.606e+02 3.456e+02 4.378e+02 5.897e+02 1.151e+03, threshold=8.757e+02, percent-clipped=2.0 2023-06-25 07:50:24,505 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1289052.0, ans=0.1 2023-06-25 07:51:03,854 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1289112.0, ans=0.0 2023-06-25 07:51:08,375 INFO [train.py:996] (2/4) Epoch 8, batch 1400, loss[loss=0.1961, simple_loss=0.2638, pruned_loss=0.06424, over 21643.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.2967, pruned_loss=0.07204, over 4279215.08 frames. ], batch size: 298, lr: 3.85e-03, grad_scale: 16.0 2023-06-25 07:51:12,850 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1289172.0, ans=0.0 2023-06-25 07:51:22,706 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1289172.0, ans=0.04949747468305833 2023-06-25 07:51:42,922 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1289292.0, ans=0.1 2023-06-25 07:51:55,559 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1289292.0, ans=0.0 2023-06-25 07:51:59,344 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1289292.0, ans=0.125 2023-06-25 07:52:33,880 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 07:52:33,955 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1289412.0, ans=0.1 2023-06-25 07:52:57,305 INFO [train.py:996] (2/4) Epoch 8, batch 1450, loss[loss=0.2014, simple_loss=0.2625, pruned_loss=0.07012, over 21467.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.2977, pruned_loss=0.0732, over 4285054.04 frames. ], batch size: 195, lr: 3.85e-03, grad_scale: 16.0 2023-06-25 07:53:05,183 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1289472.0, ans=0.125 2023-06-25 07:53:48,352 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.564e+02 3.442e+02 4.414e+02 6.258e+02 1.881e+03, threshold=8.827e+02, percent-clipped=13.0 2023-06-25 07:54:26,315 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1289712.0, ans=0.1 2023-06-25 07:54:44,594 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1289712.0, ans=0.0 2023-06-25 07:54:47,192 INFO [train.py:996] (2/4) Epoch 8, batch 1500, loss[loss=0.2026, simple_loss=0.2858, pruned_loss=0.05976, over 21337.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.2991, pruned_loss=0.07405, over 4289326.85 frames. ], batch size: 176, lr: 3.85e-03, grad_scale: 8.0 2023-06-25 07:55:11,962 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1289832.0, ans=0.0 2023-06-25 07:55:36,761 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1289892.0, ans=0.1 2023-06-25 07:55:36,792 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1289892.0, ans=0.125 2023-06-25 07:56:02,067 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1289952.0, ans=0.125 2023-06-25 07:56:26,412 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 07:56:40,544 INFO [train.py:996] (2/4) Epoch 8, batch 1550, loss[loss=0.1761, simple_loss=0.2472, pruned_loss=0.05251, over 21804.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.2968, pruned_loss=0.07248, over 4283527.93 frames. ], batch size: 124, lr: 3.85e-03, grad_scale: 8.0 2023-06-25 07:57:20,127 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.36 vs. limit=15.0 2023-06-25 07:57:35,131 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.389e+02 3.681e+02 5.239e+02 6.621e+02 1.108e+03, threshold=1.048e+03, percent-clipped=5.0 2023-06-25 07:58:14,224 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 07:58:33,628 INFO [train.py:996] (2/4) Epoch 8, batch 1600, loss[loss=0.1702, simple_loss=0.2222, pruned_loss=0.05908, over 16246.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.2977, pruned_loss=0.07291, over 4280160.13 frames. ], batch size: 62, lr: 3.85e-03, grad_scale: 16.0 2023-06-25 07:58:46,778 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1290372.0, ans=0.2 2023-06-25 07:59:45,412 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1290492.0, ans=0.125 2023-06-25 07:59:51,076 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.70 vs. limit=15.0 2023-06-25 07:59:55,882 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1290552.0, ans=0.125 2023-06-25 08:00:26,057 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1290672.0, ans=0.1 2023-06-25 08:00:26,994 INFO [train.py:996] (2/4) Epoch 8, batch 1650, loss[loss=0.2514, simple_loss=0.3226, pruned_loss=0.09008, over 21247.00 frames. ], tot_loss[loss=0.223, simple_loss=0.2995, pruned_loss=0.07327, over 4282373.06 frames. ], batch size: 176, lr: 3.85e-03, grad_scale: 16.0 2023-06-25 08:01:38,155 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.521e+02 3.337e+02 4.261e+02 5.571e+02 1.006e+03, threshold=8.522e+02, percent-clipped=0.0 2023-06-25 08:01:56,679 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1290852.0, ans=0.125 2023-06-25 08:02:20,389 INFO [train.py:996] (2/4) Epoch 8, batch 1700, loss[loss=0.2245, simple_loss=0.32, pruned_loss=0.06454, over 21726.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.3024, pruned_loss=0.07404, over 4280254.41 frames. ], batch size: 298, lr: 3.84e-03, grad_scale: 16.0 2023-06-25 08:03:16,581 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1291032.0, ans=0.0 2023-06-25 08:03:51,443 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1291152.0, ans=0.125 2023-06-25 08:03:59,407 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.07 vs. limit=5.0 2023-06-25 08:04:20,221 INFO [train.py:996] (2/4) Epoch 8, batch 1750, loss[loss=0.1821, simple_loss=0.278, pruned_loss=0.04314, over 21229.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.2996, pruned_loss=0.07191, over 4281837.22 frames. ], batch size: 548, lr: 3.84e-03, grad_scale: 16.0 2023-06-25 08:05:00,901 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.01 vs. limit=15.0 2023-06-25 08:05:17,801 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1291392.0, ans=0.125 2023-06-25 08:05:26,789 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.325e+02 3.271e+02 4.291e+02 6.912e+02 1.295e+03, threshold=8.582e+02, percent-clipped=12.0 2023-06-25 08:05:55,103 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.24 vs. limit=10.0 2023-06-25 08:06:19,524 INFO [train.py:996] (2/4) Epoch 8, batch 1800, loss[loss=0.2198, simple_loss=0.3071, pruned_loss=0.06621, over 19942.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.2994, pruned_loss=0.07086, over 4279954.89 frames. ], batch size: 702, lr: 3.84e-03, grad_scale: 16.0 2023-06-25 08:06:20,200 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1291572.0, ans=0.05 2023-06-25 08:07:14,800 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1291692.0, ans=0.1 2023-06-25 08:07:37,011 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1291752.0, ans=0.125 2023-06-25 08:07:43,666 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1291752.0, ans=0.125 2023-06-25 08:08:10,396 INFO [train.py:996] (2/4) Epoch 8, batch 1850, loss[loss=0.192, simple_loss=0.2826, pruned_loss=0.05068, over 21370.00 frames. ], tot_loss[loss=0.2183, simple_loss=0.2985, pruned_loss=0.06902, over 4285108.83 frames. ], batch size: 194, lr: 3.84e-03, grad_scale: 16.0 2023-06-25 08:08:29,164 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1291872.0, ans=0.035 2023-06-25 08:08:29,231 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1291872.0, ans=0.125 2023-06-25 08:08:58,936 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1291992.0, ans=0.0 2023-06-25 08:09:02,695 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1291992.0, ans=0.0 2023-06-25 08:09:08,872 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.511e+02 3.967e+02 5.452e+02 7.986e+02 1.937e+03, threshold=1.090e+03, percent-clipped=22.0 2023-06-25 08:09:25,789 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1292052.0, ans=0.125 2023-06-25 08:09:29,293 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=1292052.0, ans=0.95 2023-06-25 08:09:38,654 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1292112.0, ans=0.125 2023-06-25 08:10:05,928 INFO [train.py:996] (2/4) Epoch 8, batch 1900, loss[loss=0.207, simple_loss=0.2841, pruned_loss=0.06496, over 21799.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.2971, pruned_loss=0.06965, over 4280875.38 frames. ], batch size: 282, lr: 3.84e-03, grad_scale: 16.0 2023-06-25 08:10:51,138 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1292292.0, ans=0.0 2023-06-25 08:11:27,820 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1292412.0, ans=0.0 2023-06-25 08:11:34,961 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1292412.0, ans=0.125 2023-06-25 08:11:45,476 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.27 vs. limit=22.5 2023-06-25 08:12:01,308 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1292412.0, ans=0.1 2023-06-25 08:12:01,561 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1292412.0, ans=0.125 2023-06-25 08:12:04,377 INFO [train.py:996] (2/4) Epoch 8, batch 1950, loss[loss=0.2048, simple_loss=0.2744, pruned_loss=0.06761, over 15246.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2946, pruned_loss=0.06977, over 4268083.55 frames. ], batch size: 60, lr: 3.84e-03, grad_scale: 16.0 2023-06-25 08:12:24,535 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1292532.0, ans=0.07 2023-06-25 08:12:39,677 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.26 vs. limit=15.0 2023-06-25 08:13:00,221 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.618e+02 4.190e+02 5.257e+02 7.093e+02 1.583e+03, threshold=1.051e+03, percent-clipped=6.0 2023-06-25 08:13:00,988 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1292592.0, ans=0.125 2023-06-25 08:13:16,005 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.08 vs. limit=6.0 2023-06-25 08:13:52,823 INFO [train.py:996] (2/4) Epoch 8, batch 2000, loss[loss=0.1976, simple_loss=0.2602, pruned_loss=0.06745, over 21470.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.2915, pruned_loss=0.06808, over 4273949.11 frames. ], batch size: 195, lr: 3.84e-03, grad_scale: 32.0 2023-06-25 08:14:02,183 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1292772.0, ans=0.125 2023-06-25 08:14:26,018 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1292892.0, ans=0.125 2023-06-25 08:15:07,975 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1293012.0, ans=0.0 2023-06-25 08:15:37,615 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1293012.0, ans=0.125 2023-06-25 08:15:44,214 INFO [train.py:996] (2/4) Epoch 8, batch 2050, loss[loss=0.2525, simple_loss=0.361, pruned_loss=0.07196, over 19864.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2944, pruned_loss=0.06793, over 4279034.54 frames. ], batch size: 702, lr: 3.84e-03, grad_scale: 16.0 2023-06-25 08:16:07,411 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1293132.0, ans=0.125 2023-06-25 08:16:13,045 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1293132.0, ans=0.1 2023-06-25 08:16:39,103 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.662e+02 4.169e+02 5.197e+02 7.491e+02 1.738e+03, threshold=1.039e+03, percent-clipped=10.0 2023-06-25 08:16:55,867 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.81 vs. limit=15.0 2023-06-25 08:17:01,171 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1293312.0, ans=0.125 2023-06-25 08:17:35,783 INFO [train.py:996] (2/4) Epoch 8, batch 2100, loss[loss=0.195, simple_loss=0.2809, pruned_loss=0.05455, over 21783.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.298, pruned_loss=0.06971, over 4274635.98 frames. ], batch size: 351, lr: 3.84e-03, grad_scale: 16.0 2023-06-25 08:17:41,390 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1293372.0, ans=0.125 2023-06-25 08:17:50,858 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.77 vs. limit=15.0 2023-06-25 08:17:53,864 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1293432.0, ans=0.1 2023-06-25 08:19:27,057 INFO [train.py:996] (2/4) Epoch 8, batch 2150, loss[loss=0.2379, simple_loss=0.3377, pruned_loss=0.06907, over 21760.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.2967, pruned_loss=0.07046, over 4275898.45 frames. ], batch size: 282, lr: 3.84e-03, grad_scale: 16.0 2023-06-25 08:19:40,512 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.10 vs. limit=10.0 2023-06-25 08:19:51,275 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.82 vs. limit=22.5 2023-06-25 08:20:23,096 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.526e+02 3.335e+02 3.972e+02 5.687e+02 1.021e+03, threshold=7.943e+02, percent-clipped=0.0 2023-06-25 08:20:24,081 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1293792.0, ans=0.0 2023-06-25 08:20:38,061 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1293852.0, ans=0.125 2023-06-25 08:21:19,279 INFO [train.py:996] (2/4) Epoch 8, batch 2200, loss[loss=0.2268, simple_loss=0.3063, pruned_loss=0.07364, over 21888.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.2982, pruned_loss=0.06978, over 4276305.77 frames. ], batch size: 351, lr: 3.84e-03, grad_scale: 16.0 2023-06-25 08:21:25,374 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1293972.0, ans=0.125 2023-06-25 08:21:39,172 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1294032.0, ans=0.0 2023-06-25 08:21:45,638 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1294032.0, ans=0.125 2023-06-25 08:21:51,518 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1294032.0, ans=0.125 2023-06-25 08:22:36,540 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1294152.0, ans=0.0 2023-06-25 08:23:00,134 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1294212.0, ans=0.125 2023-06-25 08:23:08,639 INFO [train.py:996] (2/4) Epoch 8, batch 2250, loss[loss=0.215, simple_loss=0.3052, pruned_loss=0.06244, over 21440.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2946, pruned_loss=0.0689, over 4273422.56 frames. ], batch size: 211, lr: 3.84e-03, grad_scale: 16.0 2023-06-25 08:23:58,427 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1294392.0, ans=0.125 2023-06-25 08:24:02,843 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.466e+02 3.638e+02 4.452e+02 6.050e+02 1.629e+03, threshold=8.904e+02, percent-clipped=11.0 2023-06-25 08:24:03,673 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1294392.0, ans=0.025 2023-06-25 08:24:11,553 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.22 vs. limit=15.0 2023-06-25 08:24:50,259 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1294512.0, ans=0.05 2023-06-25 08:24:52,823 INFO [train.py:996] (2/4) Epoch 8, batch 2300, loss[loss=0.2248, simple_loss=0.3249, pruned_loss=0.06232, over 21640.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.2899, pruned_loss=0.06891, over 4265525.33 frames. ], batch size: 263, lr: 3.84e-03, grad_scale: 16.0 2023-06-25 08:25:02,348 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1294572.0, ans=0.0 2023-06-25 08:25:10,623 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1294632.0, ans=0.125 2023-06-25 08:25:37,892 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1294692.0, ans=0.2 2023-06-25 08:26:25,698 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1294812.0, ans=0.0 2023-06-25 08:26:46,493 INFO [train.py:996] (2/4) Epoch 8, batch 2350, loss[loss=0.2908, simple_loss=0.3559, pruned_loss=0.1128, over 21768.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.2891, pruned_loss=0.06956, over 4273221.91 frames. ], batch size: 441, lr: 3.84e-03, grad_scale: 16.0 2023-06-25 08:27:41,222 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.567e+02 4.172e+02 5.399e+02 7.196e+02 1.286e+03, threshold=1.080e+03, percent-clipped=11.0 2023-06-25 08:27:59,189 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1295052.0, ans=0.0 2023-06-25 08:28:37,753 INFO [train.py:996] (2/4) Epoch 8, batch 2400, loss[loss=0.2137, simple_loss=0.2854, pruned_loss=0.07094, over 21228.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2925, pruned_loss=0.07174, over 4272796.57 frames. ], batch size: 548, lr: 3.84e-03, grad_scale: 32.0 2023-06-25 08:28:50,363 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1295172.0, ans=0.125 2023-06-25 08:28:50,381 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1295172.0, ans=0.09899494936611666 2023-06-25 08:29:01,311 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1295232.0, ans=0.0 2023-06-25 08:29:27,572 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.24 vs. limit=15.0 2023-06-25 08:29:48,530 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1295352.0, ans=0.125 2023-06-25 08:30:27,393 INFO [train.py:996] (2/4) Epoch 8, batch 2450, loss[loss=0.2077, simple_loss=0.2721, pruned_loss=0.07162, over 21874.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.2957, pruned_loss=0.07232, over 4268153.56 frames. ], batch size: 107, lr: 3.84e-03, grad_scale: 16.0 2023-06-25 08:30:39,346 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.55 vs. limit=10.0 2023-06-25 08:31:13,345 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1295592.0, ans=0.125 2023-06-25 08:31:24,811 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.579e+02 3.841e+02 6.208e+02 9.164e+02 1.809e+03, threshold=1.242e+03, percent-clipped=16.0 2023-06-25 08:31:31,319 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1295652.0, ans=0.125 2023-06-25 08:31:50,573 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1295652.0, ans=0.125 2023-06-25 08:32:12,787 INFO [train.py:996] (2/4) Epoch 8, batch 2500, loss[loss=0.1925, simple_loss=0.2743, pruned_loss=0.05534, over 21726.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.2946, pruned_loss=0.07107, over 4272161.44 frames. ], batch size: 124, lr: 3.84e-03, grad_scale: 16.0 2023-06-25 08:32:29,192 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1295832.0, ans=0.0 2023-06-25 08:33:43,959 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1296012.0, ans=0.125 2023-06-25 08:33:52,066 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.27 vs. limit=8.0 2023-06-25 08:33:59,246 INFO [train.py:996] (2/4) Epoch 8, batch 2550, loss[loss=0.2255, simple_loss=0.2969, pruned_loss=0.07702, over 14936.00 frames. ], tot_loss[loss=0.217, simple_loss=0.292, pruned_loss=0.07097, over 4261694.72 frames. ], batch size: 61, lr: 3.84e-03, grad_scale: 16.0 2023-06-25 08:34:01,635 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1296072.0, ans=0.125 2023-06-25 08:34:56,189 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.632e+02 3.347e+02 3.968e+02 6.148e+02 1.129e+03, threshold=7.936e+02, percent-clipped=0.0 2023-06-25 08:35:36,396 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 08:35:49,664 INFO [train.py:996] (2/4) Epoch 8, batch 2600, loss[loss=0.225, simple_loss=0.3013, pruned_loss=0.07434, over 21309.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.293, pruned_loss=0.07213, over 4263633.62 frames. ], batch size: 143, lr: 3.84e-03, grad_scale: 16.0 2023-06-25 08:35:57,085 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1296372.0, ans=0.0 2023-06-25 08:36:20,406 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1296432.0, ans=0.0 2023-06-25 08:36:31,094 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1296492.0, ans=0.0 2023-06-25 08:36:38,963 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.76 vs. limit=15.0 2023-06-25 08:37:09,336 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.04 vs. limit=15.0 2023-06-25 08:37:40,598 INFO [train.py:996] (2/4) Epoch 8, batch 2650, loss[loss=0.2356, simple_loss=0.3158, pruned_loss=0.07772, over 21807.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.295, pruned_loss=0.07414, over 4275901.63 frames. ], batch size: 282, lr: 3.84e-03, grad_scale: 16.0 2023-06-25 08:38:29,994 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1296792.0, ans=0.125 2023-06-25 08:38:37,355 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.620e+02 3.828e+02 4.857e+02 7.020e+02 1.360e+03, threshold=9.714e+02, percent-clipped=21.0 2023-06-25 08:38:45,736 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.39 vs. limit=15.0 2023-06-25 08:39:24,892 INFO [train.py:996] (2/4) Epoch 8, batch 2700, loss[loss=0.2246, simple_loss=0.2932, pruned_loss=0.07803, over 21794.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.2944, pruned_loss=0.07302, over 4280397.38 frames. ], batch size: 124, lr: 3.84e-03, grad_scale: 16.0 2023-06-25 08:40:05,586 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1297092.0, ans=0.125 2023-06-25 08:40:13,759 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.44 vs. limit=6.0 2023-06-25 08:40:18,455 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1297092.0, ans=0.125 2023-06-25 08:40:34,155 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1297152.0, ans=0.1 2023-06-25 08:40:35,909 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1297152.0, ans=0.025 2023-06-25 08:41:17,924 INFO [train.py:996] (2/4) Epoch 8, batch 2750, loss[loss=0.2478, simple_loss=0.3259, pruned_loss=0.08489, over 21354.00 frames. ], tot_loss[loss=0.2192, simple_loss=0.2944, pruned_loss=0.07207, over 4285023.33 frames. ], batch size: 159, lr: 3.84e-03, grad_scale: 16.0 2023-06-25 08:41:34,893 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 08:42:01,253 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1297392.0, ans=0.0 2023-06-25 08:42:03,451 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.56 vs. limit=15.0 2023-06-25 08:42:09,649 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1297392.0, ans=0.04949747468305833 2023-06-25 08:42:27,793 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.672e+02 4.055e+02 5.362e+02 7.595e+02 1.481e+03, threshold=1.072e+03, percent-clipped=12.0 2023-06-25 08:42:33,600 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1297452.0, ans=0.125 2023-06-25 08:43:11,598 INFO [train.py:996] (2/4) Epoch 8, batch 2800, loss[loss=0.2588, simple_loss=0.3216, pruned_loss=0.09796, over 21406.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.2993, pruned_loss=0.07378, over 4284223.23 frames. ], batch size: 211, lr: 3.83e-03, grad_scale: 32.0 2023-06-25 08:43:39,636 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1297632.0, ans=0.0 2023-06-25 08:43:56,692 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1297692.0, ans=0.125 2023-06-25 08:44:05,976 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1297692.0, ans=0.125 2023-06-25 08:44:21,147 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1297692.0, ans=0.125 2023-06-25 08:44:24,795 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1297752.0, ans=0.125 2023-06-25 08:44:53,353 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=1297812.0, ans=0.025 2023-06-25 08:44:59,936 INFO [train.py:996] (2/4) Epoch 8, batch 2850, loss[loss=0.226, simple_loss=0.2984, pruned_loss=0.07684, over 21330.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.3018, pruned_loss=0.07543, over 4286194.47 frames. ], batch size: 176, lr: 3.83e-03, grad_scale: 32.0 2023-06-25 08:45:23,601 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1297932.0, ans=0.125 2023-06-25 08:45:36,023 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1297932.0, ans=0.125 2023-06-25 08:46:13,110 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.766e+02 3.662e+02 5.066e+02 7.139e+02 1.545e+03, threshold=1.013e+03, percent-clipped=5.0 2023-06-25 08:46:50,049 INFO [train.py:996] (2/4) Epoch 8, batch 2900, loss[loss=0.2106, simple_loss=0.2794, pruned_loss=0.07093, over 21806.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.2974, pruned_loss=0.07463, over 4285286.96 frames. ], batch size: 282, lr: 3.83e-03, grad_scale: 16.0 2023-06-25 08:46:57,917 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1298172.0, ans=0.2 2023-06-25 08:47:11,319 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1298172.0, ans=0.0 2023-06-25 08:48:26,891 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1298412.0, ans=0.125 2023-06-25 08:48:39,275 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1298412.0, ans=0.125 2023-06-25 08:48:42,078 INFO [train.py:996] (2/4) Epoch 8, batch 2950, loss[loss=0.2241, simple_loss=0.332, pruned_loss=0.0581, over 20866.00 frames. ], tot_loss[loss=0.2248, simple_loss=0.3001, pruned_loss=0.07476, over 4287284.71 frames. ], batch size: 607, lr: 3.83e-03, grad_scale: 16.0 2023-06-25 08:49:15,045 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1298532.0, ans=0.1 2023-06-25 08:49:20,573 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1298532.0, ans=0.04949747468305833 2023-06-25 08:49:28,442 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.18 vs. limit=10.0 2023-06-25 08:49:46,143 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.35 vs. limit=6.0 2023-06-25 08:49:57,024 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.695e+02 3.497e+02 4.851e+02 7.009e+02 1.350e+03, threshold=9.702e+02, percent-clipped=11.0 2023-06-25 08:50:33,451 INFO [train.py:996] (2/4) Epoch 8, batch 3000, loss[loss=0.2356, simple_loss=0.3109, pruned_loss=0.08014, over 21796.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3041, pruned_loss=0.07574, over 4290677.02 frames. ], batch size: 247, lr: 3.83e-03, grad_scale: 16.0 2023-06-25 08:50:33,451 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-25 08:50:54,967 INFO [train.py:1028] (2/4) Epoch 8, validation: loss=0.2557, simple_loss=0.3462, pruned_loss=0.08265, over 1796401.00 frames. 2023-06-25 08:50:54,968 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 23731MB 2023-06-25 08:51:30,207 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1298832.0, ans=0.125 2023-06-25 08:51:30,875 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.05 vs. limit=10.0 2023-06-25 08:52:18,832 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1298952.0, ans=0.2 2023-06-25 08:52:24,450 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1299012.0, ans=0.2 2023-06-25 08:52:45,452 INFO [train.py:996] (2/4) Epoch 8, batch 3050, loss[loss=0.1864, simple_loss=0.2819, pruned_loss=0.04545, over 21769.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.3008, pruned_loss=0.07316, over 4286397.75 frames. ], batch size: 351, lr: 3.83e-03, grad_scale: 16.0 2023-06-25 08:53:14,980 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1299132.0, ans=0.0 2023-06-25 08:53:15,604 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.19 vs. limit=12.0 2023-06-25 08:53:55,755 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.393e+02 3.327e+02 3.997e+02 5.438e+02 1.383e+03, threshold=7.994e+02, percent-clipped=4.0 2023-06-25 08:54:26,831 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1299312.0, ans=0.125 2023-06-25 08:54:37,080 INFO [train.py:996] (2/4) Epoch 8, batch 3100, loss[loss=0.2087, simple_loss=0.3041, pruned_loss=0.05665, over 21687.00 frames. ], tot_loss[loss=0.224, simple_loss=0.3017, pruned_loss=0.07318, over 4282172.32 frames. ], batch size: 298, lr: 3.83e-03, grad_scale: 16.0 2023-06-25 08:54:40,948 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1299372.0, ans=0.04949747468305833 2023-06-25 08:55:14,468 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1299432.0, ans=0.0 2023-06-25 08:56:00,791 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1299552.0, ans=0.125 2023-06-25 08:56:06,607 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 08:56:08,151 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1299612.0, ans=0.0 2023-06-25 08:56:39,331 INFO [train.py:996] (2/4) Epoch 8, batch 3150, loss[loss=0.2318, simple_loss=0.3091, pruned_loss=0.07722, over 21569.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.3031, pruned_loss=0.07358, over 4274779.79 frames. ], batch size: 230, lr: 3.83e-03, grad_scale: 16.0 2023-06-25 08:56:39,958 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1299672.0, ans=10.0 2023-06-25 08:57:18,825 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.82 vs. limit=15.0 2023-06-25 08:57:44,758 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.490e+02 3.426e+02 4.350e+02 5.969e+02 1.538e+03, threshold=8.700e+02, percent-clipped=12.0 2023-06-25 08:57:54,669 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1299852.0, ans=0.0 2023-06-25 08:58:22,428 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.01 vs. limit=22.5 2023-06-25 08:58:36,651 INFO [train.py:996] (2/4) Epoch 8, batch 3200, loss[loss=0.2252, simple_loss=0.308, pruned_loss=0.07113, over 21778.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.304, pruned_loss=0.07331, over 4276597.29 frames. ], batch size: 332, lr: 3.83e-03, grad_scale: 32.0 2023-06-25 09:00:23,472 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 09:00:28,334 INFO [train.py:996] (2/4) Epoch 8, batch 3250, loss[loss=0.2018, simple_loss=0.2715, pruned_loss=0.06603, over 21345.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.3043, pruned_loss=0.07456, over 4265191.67 frames. ], batch size: 194, lr: 3.83e-03, grad_scale: 32.0 2023-06-25 09:00:34,297 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1300272.0, ans=0.125 2023-06-25 09:01:18,213 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1300392.0, ans=0.125 2023-06-25 09:01:23,874 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.24 vs. limit=22.5 2023-06-25 09:01:30,044 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.648e+02 3.942e+02 5.285e+02 9.066e+02 2.066e+03, threshold=1.057e+03, percent-clipped=29.0 2023-06-25 09:01:38,344 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1300452.0, ans=0.1 2023-06-25 09:01:54,801 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1300452.0, ans=0.125 2023-06-25 09:02:02,189 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.37 vs. limit=15.0 2023-06-25 09:02:20,264 INFO [train.py:996] (2/4) Epoch 8, batch 3300, loss[loss=0.2389, simple_loss=0.3237, pruned_loss=0.07702, over 21315.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.3002, pruned_loss=0.07354, over 4267600.89 frames. ], batch size: 549, lr: 3.83e-03, grad_scale: 16.0 2023-06-25 09:03:38,459 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1300752.0, ans=0.125 2023-06-25 09:03:39,019 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.50 vs. limit=15.0 2023-06-25 09:04:11,676 INFO [train.py:996] (2/4) Epoch 8, batch 3350, loss[loss=0.2652, simple_loss=0.3227, pruned_loss=0.1039, over 21594.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.3019, pruned_loss=0.0738, over 4263823.48 frames. ], batch size: 471, lr: 3.83e-03, grad_scale: 16.0 2023-06-25 09:04:30,413 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1300932.0, ans=0.125 2023-06-25 09:04:50,380 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1300932.0, ans=0.0 2023-06-25 09:05:00,909 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1300992.0, ans=0.125 2023-06-25 09:05:23,415 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.477e+02 4.026e+02 5.637e+02 8.126e+02 1.843e+03, threshold=1.127e+03, percent-clipped=12.0 2023-06-25 09:05:23,980 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 09:05:55,205 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1301112.0, ans=0.2 2023-06-25 09:06:00,492 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1301172.0, ans=0.125 2023-06-25 09:06:01,861 INFO [train.py:996] (2/4) Epoch 8, batch 3400, loss[loss=0.2218, simple_loss=0.3345, pruned_loss=0.05454, over 21225.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.3022, pruned_loss=0.07382, over 4275827.50 frames. ], batch size: 549, lr: 3.83e-03, grad_scale: 16.0 2023-06-25 09:06:08,012 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1301172.0, ans=0.1 2023-06-25 09:07:09,404 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1301292.0, ans=0.0 2023-06-25 09:07:18,305 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1301352.0, ans=0.125 2023-06-25 09:07:42,902 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.94 vs. limit=15.0 2023-06-25 09:07:55,875 INFO [train.py:996] (2/4) Epoch 8, batch 3450, loss[loss=0.1899, simple_loss=0.2577, pruned_loss=0.06104, over 21517.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.2992, pruned_loss=0.07474, over 4274854.26 frames. ], batch size: 132, lr: 3.83e-03, grad_scale: 16.0 2023-06-25 09:08:14,976 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1301472.0, ans=0.0 2023-06-25 09:09:13,943 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.637e+02 3.588e+02 4.974e+02 7.725e+02 1.763e+03, threshold=9.948e+02, percent-clipped=11.0 2023-06-25 09:09:14,940 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1301652.0, ans=0.125 2023-06-25 09:09:29,399 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1301712.0, ans=0.0 2023-06-25 09:09:53,632 INFO [train.py:996] (2/4) Epoch 8, batch 3500, loss[loss=0.2515, simple_loss=0.3242, pruned_loss=0.08937, over 21657.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3061, pruned_loss=0.07718, over 4276740.73 frames. ], batch size: 263, lr: 3.83e-03, grad_scale: 16.0 2023-06-25 09:10:28,403 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1301832.0, ans=0.04949747468305833 2023-06-25 09:10:43,720 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.28 vs. limit=15.0 2023-06-25 09:10:46,261 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1301892.0, ans=0.125 2023-06-25 09:11:43,854 INFO [train.py:996] (2/4) Epoch 8, batch 3550, loss[loss=0.1962, simple_loss=0.2673, pruned_loss=0.0626, over 21625.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3086, pruned_loss=0.07829, over 4274333.08 frames. ], batch size: 298, lr: 3.83e-03, grad_scale: 16.0 2023-06-25 09:11:55,751 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.16 vs. limit=15.0 2023-06-25 09:12:52,705 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1302252.0, ans=0.1 2023-06-25 09:12:55,362 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.560e+02 4.003e+02 5.383e+02 7.230e+02 1.174e+03, threshold=1.077e+03, percent-clipped=7.0 2023-06-25 09:13:02,044 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1302252.0, ans=0.125 2023-06-25 09:13:22,943 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.83 vs. limit=15.0 2023-06-25 09:13:35,430 INFO [train.py:996] (2/4) Epoch 8, batch 3600, loss[loss=0.2427, simple_loss=0.3136, pruned_loss=0.08591, over 21721.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.3037, pruned_loss=0.07772, over 4267001.28 frames. ], batch size: 351, lr: 3.83e-03, grad_scale: 32.0 2023-06-25 09:14:02,981 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1302432.0, ans=0.125 2023-06-25 09:14:40,435 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1302552.0, ans=0.125 2023-06-25 09:14:47,614 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1302552.0, ans=0.0 2023-06-25 09:15:18,566 INFO [train.py:996] (2/4) Epoch 8, batch 3650, loss[loss=0.2214, simple_loss=0.3097, pruned_loss=0.06654, over 21668.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.3055, pruned_loss=0.07871, over 4273769.57 frames. ], batch size: 389, lr: 3.83e-03, grad_scale: 32.0 2023-06-25 09:16:26,801 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1302792.0, ans=0.0 2023-06-25 09:16:31,575 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.652e+02 4.088e+02 5.545e+02 7.819e+02 1.547e+03, threshold=1.109e+03, percent-clipped=4.0 2023-06-25 09:16:39,312 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.82 vs. limit=10.0 2023-06-25 09:16:44,117 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1302852.0, ans=0.035 2023-06-25 09:17:09,596 INFO [train.py:996] (2/4) Epoch 8, batch 3700, loss[loss=0.1845, simple_loss=0.2493, pruned_loss=0.05981, over 21435.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.3021, pruned_loss=0.07769, over 4273891.93 frames. ], batch size: 212, lr: 3.83e-03, grad_scale: 16.0 2023-06-25 09:17:13,850 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1302972.0, ans=0.125 2023-06-25 09:17:17,906 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1302972.0, ans=0.125 2023-06-25 09:18:14,850 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1303092.0, ans=0.125 2023-06-25 09:19:01,410 INFO [train.py:996] (2/4) Epoch 8, batch 3750, loss[loss=0.1825, simple_loss=0.2646, pruned_loss=0.05024, over 21741.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.3004, pruned_loss=0.07643, over 4275532.39 frames. ], batch size: 282, lr: 3.83e-03, grad_scale: 16.0 2023-06-25 09:19:53,290 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.16 vs. limit=6.0 2023-06-25 09:20:21,196 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.521e+02 3.272e+02 4.501e+02 6.560e+02 9.292e+02, threshold=9.001e+02, percent-clipped=0.0 2023-06-25 09:20:47,851 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1303512.0, ans=0.125 2023-06-25 09:20:58,391 INFO [train.py:996] (2/4) Epoch 8, batch 3800, loss[loss=0.2932, simple_loss=0.3457, pruned_loss=0.1204, over 21408.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.2991, pruned_loss=0.07567, over 4277893.82 frames. ], batch size: 509, lr: 3.83e-03, grad_scale: 16.0 2023-06-25 09:20:58,851 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1303572.0, ans=0.0 2023-06-25 09:22:13,294 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.50 vs. limit=15.0 2023-06-25 09:22:15,874 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1303752.0, ans=0.125 2023-06-25 09:22:30,829 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1303812.0, ans=0.125 2023-06-25 09:22:35,669 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1303812.0, ans=0.2 2023-06-25 09:22:39,758 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1303872.0, ans=0.0 2023-06-25 09:22:40,703 INFO [train.py:996] (2/4) Epoch 8, batch 3850, loss[loss=0.2087, simple_loss=0.2742, pruned_loss=0.07164, over 22003.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.2966, pruned_loss=0.07541, over 4272644.60 frames. ], batch size: 103, lr: 3.83e-03, grad_scale: 16.0 2023-06-25 09:23:41,051 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1303992.0, ans=0.1 2023-06-25 09:23:58,445 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1304052.0, ans=0.125 2023-06-25 09:23:59,528 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.585e+02 3.373e+02 4.487e+02 6.167e+02 2.000e+03, threshold=8.974e+02, percent-clipped=6.0 2023-06-25 09:24:00,454 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1304052.0, ans=10.0 2023-06-25 09:24:31,275 INFO [train.py:996] (2/4) Epoch 8, batch 3900, loss[loss=0.202, simple_loss=0.272, pruned_loss=0.06602, over 21758.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.2944, pruned_loss=0.07501, over 4270161.49 frames. ], batch size: 112, lr: 3.83e-03, grad_scale: 16.0 2023-06-25 09:24:32,601 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.26 vs. limit=15.0 2023-06-25 09:25:18,859 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1304232.0, ans=0.125 2023-06-25 09:25:45,988 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1304352.0, ans=0.125 2023-06-25 09:26:10,054 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1304412.0, ans=0.0 2023-06-25 09:26:20,793 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1304472.0, ans=0.125 2023-06-25 09:26:27,155 INFO [train.py:996] (2/4) Epoch 8, batch 3950, loss[loss=0.1824, simple_loss=0.2717, pruned_loss=0.04653, over 21747.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.2957, pruned_loss=0.07446, over 4276742.38 frames. ], batch size: 332, lr: 3.82e-03, grad_scale: 16.0 2023-06-25 09:27:10,597 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1304532.0, ans=0.125 2023-06-25 09:27:38,822 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.574e+02 3.686e+02 5.186e+02 7.402e+02 1.424e+03, threshold=1.037e+03, percent-clipped=9.0 2023-06-25 09:27:51,017 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.03 vs. limit=15.0 2023-06-25 09:28:16,206 INFO [train.py:996] (2/4) Epoch 8, batch 4000, loss[loss=0.204, simple_loss=0.2618, pruned_loss=0.07309, over 20198.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2888, pruned_loss=0.07067, over 4267050.84 frames. ], batch size: 703, lr: 3.82e-03, grad_scale: 32.0 2023-06-25 09:28:26,216 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1304772.0, ans=0.125 2023-06-25 09:30:11,499 INFO [train.py:996] (2/4) Epoch 8, batch 4050, loss[loss=0.1974, simple_loss=0.2895, pruned_loss=0.05266, over 21752.00 frames. ], tot_loss[loss=0.213, simple_loss=0.2881, pruned_loss=0.069, over 4264727.83 frames. ], batch size: 351, lr: 3.82e-03, grad_scale: 32.0 2023-06-25 09:30:12,893 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.61 vs. limit=15.0 2023-06-25 09:30:15,382 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_na.min_abs, batch_count=1305072.0, ans=0.02 2023-06-25 09:31:18,967 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.508e+02 3.803e+02 4.888e+02 6.657e+02 1.371e+03, threshold=9.776e+02, percent-clipped=4.0 2023-06-25 09:31:37,150 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 09:31:59,982 INFO [train.py:996] (2/4) Epoch 8, batch 4100, loss[loss=0.2077, simple_loss=0.2973, pruned_loss=0.05908, over 21774.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.2892, pruned_loss=0.06949, over 4267725.16 frames. ], batch size: 298, lr: 3.82e-03, grad_scale: 16.0 2023-06-25 09:32:02,394 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1305372.0, ans=0.125 2023-06-25 09:32:21,692 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1305432.0, ans=0.125 2023-06-25 09:32:35,953 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1305432.0, ans=0.125 2023-06-25 09:33:48,834 INFO [train.py:996] (2/4) Epoch 8, batch 4150, loss[loss=0.1807, simple_loss=0.2658, pruned_loss=0.04781, over 21697.00 frames. ], tot_loss[loss=0.2129, simple_loss=0.2913, pruned_loss=0.06725, over 4271174.90 frames. ], batch size: 282, lr: 3.82e-03, grad_scale: 8.0 2023-06-25 09:33:54,583 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1305672.0, ans=0.125 2023-06-25 09:34:07,097 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1305672.0, ans=0.1 2023-06-25 09:34:18,094 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1305732.0, ans=0.125 2023-06-25 09:35:00,802 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.587e+02 3.172e+02 3.844e+02 5.295e+02 7.953e+02, threshold=7.689e+02, percent-clipped=0.0 2023-06-25 09:35:41,080 INFO [train.py:996] (2/4) Epoch 8, batch 4200, loss[loss=0.1947, simple_loss=0.2685, pruned_loss=0.06045, over 21104.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.2922, pruned_loss=0.06716, over 4270786.97 frames. ], batch size: 143, lr: 3.82e-03, grad_scale: 8.0 2023-06-25 09:36:00,206 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1305972.0, ans=0.0 2023-06-25 09:36:19,808 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1306032.0, ans=0.125 2023-06-25 09:36:27,684 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.12 vs. limit=15.0 2023-06-25 09:37:33,637 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.46 vs. limit=15.0 2023-06-25 09:37:38,053 INFO [train.py:996] (2/4) Epoch 8, batch 4250, loss[loss=0.2668, simple_loss=0.369, pruned_loss=0.0823, over 21274.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.2994, pruned_loss=0.0687, over 4268998.35 frames. ], batch size: 549, lr: 3.82e-03, grad_scale: 8.0 2023-06-25 09:38:36,067 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.54 vs. limit=15.0 2023-06-25 09:38:53,446 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.52 vs. limit=15.0 2023-06-25 09:38:57,635 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.607e+02 4.053e+02 6.185e+02 8.917e+02 1.733e+03, threshold=1.237e+03, percent-clipped=33.0 2023-06-25 09:39:03,998 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1306452.0, ans=0.07 2023-06-25 09:39:21,998 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1306512.0, ans=0.1 2023-06-25 09:39:38,314 INFO [train.py:996] (2/4) Epoch 8, batch 4300, loss[loss=0.2013, simple_loss=0.3049, pruned_loss=0.0488, over 21851.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.3068, pruned_loss=0.07131, over 4275515.78 frames. ], batch size: 316, lr: 3.82e-03, grad_scale: 8.0 2023-06-25 09:39:39,562 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.99 vs. limit=22.5 2023-06-25 09:41:28,181 INFO [train.py:996] (2/4) Epoch 8, batch 4350, loss[loss=0.2022, simple_loss=0.2843, pruned_loss=0.06005, over 21184.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.3039, pruned_loss=0.07016, over 4265026.65 frames. ], batch size: 548, lr: 3.82e-03, grad_scale: 8.0 2023-06-25 09:42:44,513 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.350e+02 3.580e+02 4.513e+02 6.539e+02 1.169e+03, threshold=9.025e+02, percent-clipped=0.0 2023-06-25 09:42:59,771 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1307112.0, ans=0.0 2023-06-25 09:43:03,366 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1307112.0, ans=0.125 2023-06-25 09:43:17,903 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1307172.0, ans=0.1 2023-06-25 09:43:19,230 INFO [train.py:996] (2/4) Epoch 8, batch 4400, loss[loss=0.1831, simple_loss=0.2713, pruned_loss=0.04747, over 21201.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.2988, pruned_loss=0.06988, over 4266413.55 frames. ], batch size: 159, lr: 3.82e-03, grad_scale: 16.0 2023-06-25 09:43:41,700 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1307232.0, ans=0.125 2023-06-25 09:43:48,533 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1307232.0, ans=0.125 2023-06-25 09:44:57,978 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.26 vs. limit=15.0 2023-06-25 09:45:16,019 INFO [train.py:996] (2/4) Epoch 8, batch 4450, loss[loss=0.2955, simple_loss=0.4113, pruned_loss=0.08989, over 21222.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.3081, pruned_loss=0.07152, over 4273493.15 frames. ], batch size: 548, lr: 3.82e-03, grad_scale: 16.0 2023-06-25 09:45:52,377 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1307532.0, ans=0.125 2023-06-25 09:46:32,171 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.688e+02 3.788e+02 5.957e+02 8.951e+02 1.705e+03, threshold=1.191e+03, percent-clipped=23.0 2023-06-25 09:47:06,069 INFO [train.py:996] (2/4) Epoch 8, batch 4500, loss[loss=0.2726, simple_loss=0.3432, pruned_loss=0.101, over 21629.00 frames. ], tot_loss[loss=0.228, simple_loss=0.3086, pruned_loss=0.07374, over 4280142.10 frames. ], batch size: 471, lr: 3.82e-03, grad_scale: 16.0 2023-06-25 09:47:25,821 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1307772.0, ans=0.125 2023-06-25 09:47:45,140 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1307832.0, ans=0.0 2023-06-25 09:48:12,000 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1307892.0, ans=0.125 2023-06-25 09:48:46,252 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1308012.0, ans=0.125 2023-06-25 09:48:51,475 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1308012.0, ans=0.125 2023-06-25 09:48:56,031 INFO [train.py:996] (2/4) Epoch 8, batch 4550, loss[loss=0.2661, simple_loss=0.3397, pruned_loss=0.0963, over 21903.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.3103, pruned_loss=0.07433, over 4277125.62 frames. ], batch size: 372, lr: 3.82e-03, grad_scale: 16.0 2023-06-25 09:49:15,773 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.76 vs. limit=6.0 2023-06-25 09:49:17,103 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1308072.0, ans=0.125 2023-06-25 09:49:56,153 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.42 vs. limit=15.0 2023-06-25 09:50:07,398 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.58 vs. limit=15.0 2023-06-25 09:50:18,041 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.595e+02 3.343e+02 4.134e+02 5.307e+02 1.038e+03, threshold=8.269e+02, percent-clipped=0.0 2023-06-25 09:50:18,666 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=1308252.0, ans=0.95 2023-06-25 09:50:52,059 INFO [train.py:996] (2/4) Epoch 8, batch 4600, loss[loss=0.1991, simple_loss=0.2862, pruned_loss=0.05597, over 21839.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.3124, pruned_loss=0.07534, over 4278387.48 frames. ], batch size: 332, lr: 3.82e-03, grad_scale: 16.0 2023-06-25 09:50:59,599 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1308372.0, ans=0.0 2023-06-25 09:51:12,252 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1308372.0, ans=0.0 2023-06-25 09:52:07,187 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1308552.0, ans=0.1 2023-06-25 09:52:42,597 INFO [train.py:996] (2/4) Epoch 8, batch 4650, loss[loss=0.1647, simple_loss=0.2403, pruned_loss=0.04452, over 21314.00 frames. ], tot_loss[loss=0.227, simple_loss=0.3068, pruned_loss=0.07362, over 4286866.99 frames. ], batch size: 176, lr: 3.82e-03, grad_scale: 8.0 2023-06-25 09:52:44,715 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1308672.0, ans=0.1 2023-06-25 09:53:59,077 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.293e+02 3.213e+02 3.806e+02 5.357e+02 1.908e+03, threshold=7.612e+02, percent-clipped=10.0 2023-06-25 09:54:19,682 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1308912.0, ans=0.04949747468305833 2023-06-25 09:54:28,517 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1308912.0, ans=0.0 2023-06-25 09:54:31,184 INFO [train.py:996] (2/4) Epoch 8, batch 4700, loss[loss=0.1957, simple_loss=0.2614, pruned_loss=0.06494, over 21522.00 frames. ], tot_loss[loss=0.219, simple_loss=0.296, pruned_loss=0.07096, over 4291908.47 frames. ], batch size: 391, lr: 3.82e-03, grad_scale: 8.0 2023-06-25 09:54:43,894 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1308972.0, ans=0.125 2023-06-25 09:55:03,906 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1309032.0, ans=0.1 2023-06-25 09:55:12,367 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1309032.0, ans=0.0 2023-06-25 09:56:14,802 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 09:56:21,259 INFO [train.py:996] (2/4) Epoch 8, batch 4750, loss[loss=0.2513, simple_loss=0.3096, pruned_loss=0.09647, over 21865.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.291, pruned_loss=0.0711, over 4293032.03 frames. ], batch size: 415, lr: 3.82e-03, grad_scale: 8.0 2023-06-25 09:56:22,713 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.48 vs. limit=15.0 2023-06-25 09:56:50,613 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1309332.0, ans=0.1 2023-06-25 09:56:52,848 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.34 vs. limit=22.5 2023-06-25 09:57:39,342 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.746e+02 3.551e+02 4.538e+02 6.106e+02 1.235e+03, threshold=9.075e+02, percent-clipped=15.0 2023-06-25 09:57:46,071 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.76 vs. limit=15.0 2023-06-25 09:57:52,929 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1309512.0, ans=0.09899494936611666 2023-06-25 09:58:03,190 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1309512.0, ans=0.2 2023-06-25 09:58:17,097 INFO [train.py:996] (2/4) Epoch 8, batch 4800, loss[loss=0.223, simple_loss=0.2923, pruned_loss=0.07686, over 21297.00 frames. ], tot_loss[loss=0.2181, simple_loss=0.2921, pruned_loss=0.07207, over 4291210.39 frames. ], batch size: 176, lr: 3.82e-03, grad_scale: 16.0 2023-06-25 09:59:09,747 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1309692.0, ans=0.0 2023-06-25 09:59:36,104 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.43 vs. limit=15.0 2023-06-25 09:59:46,438 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1309812.0, ans=0.125 2023-06-25 09:59:59,478 INFO [train.py:996] (2/4) Epoch 8, batch 4850, loss[loss=0.2037, simple_loss=0.2742, pruned_loss=0.06661, over 15586.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.2918, pruned_loss=0.07095, over 4279422.86 frames. ], batch size: 60, lr: 3.82e-03, grad_scale: 16.0 2023-06-25 10:00:21,396 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1309872.0, ans=0.125 2023-06-25 10:00:42,587 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.05 vs. limit=6.0 2023-06-25 10:01:16,361 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.775e+02 3.669e+02 4.660e+02 6.748e+02 1.065e+03, threshold=9.320e+02, percent-clipped=5.0 2023-06-25 10:01:44,401 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.98 vs. limit=15.0 2023-06-25 10:01:48,392 INFO [train.py:996] (2/4) Epoch 8, batch 4900, loss[loss=0.2309, simple_loss=0.3012, pruned_loss=0.0803, over 16541.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.2928, pruned_loss=0.07214, over 4282399.49 frames. ], batch size: 63, lr: 3.82e-03, grad_scale: 16.0 2023-06-25 10:02:22,112 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1310232.0, ans=0.125 2023-06-25 10:02:37,517 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1310292.0, ans=0.04949747468305833 2023-06-25 10:03:47,922 INFO [train.py:996] (2/4) Epoch 8, batch 4950, loss[loss=0.2214, simple_loss=0.3221, pruned_loss=0.06035, over 21158.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.2962, pruned_loss=0.07068, over 4275781.13 frames. ], batch size: 548, lr: 3.82e-03, grad_scale: 16.0 2023-06-25 10:03:54,286 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1310472.0, ans=0.07 2023-06-25 10:04:20,952 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.30 vs. limit=15.0 2023-06-25 10:05:00,807 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.225e+02 3.071e+02 4.183e+02 5.786e+02 1.763e+03, threshold=8.366e+02, percent-clipped=8.0 2023-06-25 10:05:03,203 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1310652.0, ans=0.0 2023-06-25 10:05:11,372 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1310712.0, ans=0.1 2023-06-25 10:05:37,416 INFO [train.py:996] (2/4) Epoch 8, batch 5000, loss[loss=0.2245, simple_loss=0.3474, pruned_loss=0.05078, over 20750.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2964, pruned_loss=0.06753, over 4276048.22 frames. ], batch size: 607, lr: 3.82e-03, grad_scale: 16.0 2023-06-25 10:05:43,504 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.06 vs. limit=6.0 2023-06-25 10:06:14,125 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1310892.0, ans=0.125 2023-06-25 10:06:27,726 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1310892.0, ans=0.2 2023-06-25 10:06:29,261 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1310892.0, ans=0.125 2023-06-25 10:06:29,358 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1310892.0, ans=0.04949747468305833 2023-06-25 10:06:30,897 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1310892.0, ans=0.95 2023-06-25 10:06:39,699 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1310952.0, ans=0.0 2023-06-25 10:07:08,030 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.75 vs. limit=15.0 2023-06-25 10:07:19,125 INFO [train.py:996] (2/4) Epoch 8, batch 5050, loss[loss=0.2186, simple_loss=0.2889, pruned_loss=0.07416, over 21945.00 frames. ], tot_loss[loss=0.2172, simple_loss=0.2969, pruned_loss=0.0687, over 4281863.52 frames. ], batch size: 316, lr: 3.82e-03, grad_scale: 16.0 2023-06-25 10:07:49,044 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1311132.0, ans=0.2 2023-06-25 10:08:30,091 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.462e+02 3.598e+02 4.329e+02 6.155e+02 1.761e+03, threshold=8.658e+02, percent-clipped=10.0 2023-06-25 10:08:43,349 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1311312.0, ans=0.125 2023-06-25 10:08:46,856 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1311312.0, ans=0.1 2023-06-25 10:09:07,169 INFO [train.py:996] (2/4) Epoch 8, batch 5100, loss[loss=0.2379, simple_loss=0.3096, pruned_loss=0.0831, over 21402.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.2964, pruned_loss=0.07025, over 4291246.89 frames. ], batch size: 176, lr: 3.81e-03, grad_scale: 16.0 2023-06-25 10:09:25,516 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1311372.0, ans=0.125 2023-06-25 10:10:21,627 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1311552.0, ans=0.0 2023-06-25 10:10:40,991 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1311612.0, ans=0.0 2023-06-25 10:10:52,969 INFO [train.py:996] (2/4) Epoch 8, batch 5150, loss[loss=0.2236, simple_loss=0.2975, pruned_loss=0.07484, over 21348.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.295, pruned_loss=0.07094, over 4292333.07 frames. ], batch size: 159, lr: 3.81e-03, grad_scale: 16.0 2023-06-25 10:11:59,317 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1311792.0, ans=0.125 2023-06-25 10:12:11,330 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.655e+02 3.617e+02 5.481e+02 7.313e+02 1.650e+03, threshold=1.096e+03, percent-clipped=16.0 2023-06-25 10:12:48,518 INFO [train.py:996] (2/4) Epoch 8, batch 5200, loss[loss=0.2471, simple_loss=0.3468, pruned_loss=0.0737, over 21863.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.2977, pruned_loss=0.07185, over 4291481.49 frames. ], batch size: 371, lr: 3.81e-03, grad_scale: 32.0 2023-06-25 10:12:56,515 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1311972.0, ans=0.125 2023-06-25 10:13:19,656 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1312032.0, ans=0.125 2023-06-25 10:13:21,257 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1312032.0, ans=0.2 2023-06-25 10:14:13,061 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1312152.0, ans=0.0 2023-06-25 10:14:20,668 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1312212.0, ans=0.125 2023-06-25 10:14:30,784 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1312212.0, ans=0.2 2023-06-25 10:14:32,228 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1312212.0, ans=0.0 2023-06-25 10:14:43,379 INFO [train.py:996] (2/4) Epoch 8, batch 5250, loss[loss=0.1777, simple_loss=0.2643, pruned_loss=0.04555, over 21321.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.3009, pruned_loss=0.07045, over 4282978.91 frames. ], batch size: 131, lr: 3.81e-03, grad_scale: 32.0 2023-06-25 10:15:10,679 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.57 vs. limit=22.5 2023-06-25 10:15:25,938 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.07 vs. limit=15.0 2023-06-25 10:15:53,839 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.659e+02 3.587e+02 4.772e+02 6.547e+02 1.598e+03, threshold=9.543e+02, percent-clipped=4.0 2023-06-25 10:16:28,845 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1312572.0, ans=0.125 2023-06-25 10:16:29,970 INFO [train.py:996] (2/4) Epoch 8, batch 5300, loss[loss=0.2521, simple_loss=0.3132, pruned_loss=0.09546, over 21812.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.3006, pruned_loss=0.07197, over 4285964.04 frames. ], batch size: 441, lr: 3.81e-03, grad_scale: 32.0 2023-06-25 10:16:30,561 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1312572.0, ans=0.2 2023-06-25 10:17:57,320 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1312812.0, ans=0.125 2023-06-25 10:18:17,017 INFO [train.py:996] (2/4) Epoch 8, batch 5350, loss[loss=0.2281, simple_loss=0.2851, pruned_loss=0.08555, over 21587.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.2992, pruned_loss=0.07295, over 4292663.59 frames. ], batch size: 548, lr: 3.81e-03, grad_scale: 32.0 2023-06-25 10:18:29,829 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1312872.0, ans=0.125 2023-06-25 10:18:33,332 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1312932.0, ans=0.0 2023-06-25 10:19:28,374 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.684e+02 3.542e+02 4.424e+02 5.994e+02 1.106e+03, threshold=8.848e+02, percent-clipped=4.0 2023-06-25 10:20:05,569 INFO [train.py:996] (2/4) Epoch 8, batch 5400, loss[loss=0.1937, simple_loss=0.2702, pruned_loss=0.05859, over 21715.00 frames. ], tot_loss[loss=0.2219, simple_loss=0.2971, pruned_loss=0.07336, over 4294683.36 frames. ], batch size: 247, lr: 3.81e-03, grad_scale: 32.0 2023-06-25 10:20:13,063 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1313172.0, ans=0.125 2023-06-25 10:20:36,462 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=18.66 vs. limit=15.0 2023-06-25 10:20:46,334 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1313292.0, ans=0.0 2023-06-25 10:20:51,710 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1313292.0, ans=0.1 2023-06-25 10:20:53,812 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1313292.0, ans=0.125 2023-06-25 10:21:41,859 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1313412.0, ans=0.125 2023-06-25 10:21:50,257 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1313412.0, ans=0.1 2023-06-25 10:21:55,123 INFO [train.py:996] (2/4) Epoch 8, batch 5450, loss[loss=0.2234, simple_loss=0.3074, pruned_loss=0.06969, over 21555.00 frames. ], tot_loss[loss=0.221, simple_loss=0.2987, pruned_loss=0.07159, over 4291679.91 frames. ], batch size: 194, lr: 3.81e-03, grad_scale: 16.0 2023-06-25 10:21:59,091 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1313472.0, ans=0.0 2023-06-25 10:22:21,870 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1313532.0, ans=0.2 2023-06-25 10:23:15,417 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.599e+02 4.381e+02 6.345e+02 1.127e+03 2.400e+03, threshold=1.269e+03, percent-clipped=34.0 2023-06-25 10:23:43,129 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1313712.0, ans=0.1 2023-06-25 10:23:45,599 INFO [train.py:996] (2/4) Epoch 8, batch 5500, loss[loss=0.1939, simple_loss=0.2947, pruned_loss=0.0466, over 21712.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.3029, pruned_loss=0.06826, over 4287462.88 frames. ], batch size: 351, lr: 3.81e-03, grad_scale: 8.0 2023-06-25 10:23:48,690 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.52 vs. limit=6.0 2023-06-25 10:24:40,956 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1313892.0, ans=0.1 2023-06-25 10:25:03,100 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.47 vs. limit=15.0 2023-06-25 10:25:20,140 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1314012.0, ans=0.125 2023-06-25 10:25:35,624 INFO [train.py:996] (2/4) Epoch 8, batch 5550, loss[loss=0.2042, simple_loss=0.285, pruned_loss=0.06169, over 20845.00 frames. ], tot_loss[loss=0.2167, simple_loss=0.302, pruned_loss=0.06573, over 4281590.86 frames. ], batch size: 607, lr: 3.81e-03, grad_scale: 8.0 2023-06-25 10:26:00,504 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1314132.0, ans=0.125 2023-06-25 10:26:41,590 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1314192.0, ans=0.0 2023-06-25 10:26:53,380 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1314252.0, ans=0.125 2023-06-25 10:26:53,404 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 10:27:03,715 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.281e+02 3.121e+02 4.354e+02 6.729e+02 1.471e+03, threshold=8.708e+02, percent-clipped=1.0 2023-06-25 10:27:26,992 INFO [train.py:996] (2/4) Epoch 8, batch 5600, loss[loss=0.2841, simple_loss=0.3806, pruned_loss=0.09383, over 21603.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.3011, pruned_loss=0.06374, over 4277330.66 frames. ], batch size: 441, lr: 3.81e-03, grad_scale: 16.0 2023-06-25 10:27:29,670 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1314372.0, ans=0.025 2023-06-25 10:28:17,716 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 10:29:02,226 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1314612.0, ans=0.125 2023-06-25 10:29:15,434 INFO [train.py:996] (2/4) Epoch 8, batch 5650, loss[loss=0.1854, simple_loss=0.2991, pruned_loss=0.03588, over 20720.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.3043, pruned_loss=0.06534, over 4268376.57 frames. ], batch size: 608, lr: 3.81e-03, grad_scale: 16.0 2023-06-25 10:29:21,890 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.09 vs. limit=15.0 2023-06-25 10:29:32,205 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1314672.0, ans=0.125 2023-06-25 10:30:42,463 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.720e+02 4.225e+02 5.470e+02 8.803e+02 1.575e+03, threshold=1.094e+03, percent-clipped=25.0 2023-06-25 10:30:52,779 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1314912.0, ans=0.125 2023-06-25 10:31:12,026 INFO [train.py:996] (2/4) Epoch 8, batch 5700, loss[loss=0.2282, simple_loss=0.3243, pruned_loss=0.06605, over 21673.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.3035, pruned_loss=0.06742, over 4279301.96 frames. ], batch size: 414, lr: 3.81e-03, grad_scale: 16.0 2023-06-25 10:32:02,844 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1315092.0, ans=0.1 2023-06-25 10:32:04,939 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1315092.0, ans=0.0 2023-06-25 10:32:46,636 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1315212.0, ans=0.025 2023-06-25 10:32:48,832 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.65 vs. limit=15.0 2023-06-25 10:33:10,930 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.08 vs. limit=15.0 2023-06-25 10:33:14,791 INFO [train.py:996] (2/4) Epoch 8, batch 5750, loss[loss=0.1777, simple_loss=0.2615, pruned_loss=0.0469, over 21289.00 frames. ], tot_loss[loss=0.2149, simple_loss=0.3002, pruned_loss=0.06484, over 4280758.73 frames. ], batch size: 176, lr: 3.81e-03, grad_scale: 16.0 2023-06-25 10:34:03,440 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1315392.0, ans=0.0 2023-06-25 10:34:05,084 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1315392.0, ans=0.125 2023-06-25 10:34:26,641 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1315452.0, ans=0.0 2023-06-25 10:34:31,287 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.460e+02 3.956e+02 5.585e+02 8.690e+02 2.193e+03, threshold=1.117e+03, percent-clipped=12.0 2023-06-25 10:35:05,106 INFO [train.py:996] (2/4) Epoch 8, batch 5800, loss[loss=0.2026, simple_loss=0.2971, pruned_loss=0.05408, over 21676.00 frames. ], tot_loss[loss=0.2137, simple_loss=0.3001, pruned_loss=0.06366, over 4275464.76 frames. ], batch size: 247, lr: 3.81e-03, grad_scale: 16.0 2023-06-25 10:35:21,460 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1315632.0, ans=0.125 2023-06-25 10:36:08,332 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1315752.0, ans=0.1 2023-06-25 10:36:55,287 INFO [train.py:996] (2/4) Epoch 8, batch 5850, loss[loss=0.1739, simple_loss=0.2593, pruned_loss=0.04425, over 21210.00 frames. ], tot_loss[loss=0.2094, simple_loss=0.2985, pruned_loss=0.06019, over 4275309.71 frames. ], batch size: 159, lr: 3.81e-03, grad_scale: 16.0 2023-06-25 10:36:56,166 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1315872.0, ans=0.125 2023-06-25 10:37:51,743 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1315992.0, ans=0.125 2023-06-25 10:37:52,299 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.48 vs. limit=15.0 2023-06-25 10:37:57,162 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=16.51 vs. limit=15.0 2023-06-25 10:38:15,570 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.06 vs. limit=15.0 2023-06-25 10:38:21,043 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.019e+02 3.016e+02 4.169e+02 5.558e+02 1.178e+03, threshold=8.338e+02, percent-clipped=1.0 2023-06-25 10:38:26,060 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.56 vs. limit=22.5 2023-06-25 10:38:30,305 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1316112.0, ans=0.0 2023-06-25 10:38:43,382 INFO [train.py:996] (2/4) Epoch 8, batch 5900, loss[loss=0.2182, simple_loss=0.2905, pruned_loss=0.07294, over 21278.00 frames. ], tot_loss[loss=0.2025, simple_loss=0.2924, pruned_loss=0.05628, over 4265792.02 frames. ], batch size: 143, lr: 3.81e-03, grad_scale: 16.0 2023-06-25 10:39:43,902 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=1316292.0, ans=15.0 2023-06-25 10:40:36,586 INFO [train.py:996] (2/4) Epoch 8, batch 5950, loss[loss=0.1963, simple_loss=0.2619, pruned_loss=0.06535, over 21550.00 frames. ], tot_loss[loss=0.2047, simple_loss=0.2903, pruned_loss=0.05952, over 4271603.55 frames. ], batch size: 230, lr: 3.81e-03, grad_scale: 16.0 2023-06-25 10:41:16,287 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.93 vs. limit=15.0 2023-06-25 10:41:19,153 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1316592.0, ans=0.0 2023-06-25 10:41:43,754 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1316652.0, ans=0.035 2023-06-25 10:41:46,206 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.21 vs. limit=6.0 2023-06-25 10:41:57,154 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.419e+02 3.705e+02 4.644e+02 6.015e+02 1.261e+03, threshold=9.288e+02, percent-clipped=6.0 2023-06-25 10:42:06,353 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1316712.0, ans=0.125 2023-06-25 10:42:24,685 INFO [train.py:996] (2/4) Epoch 8, batch 6000, loss[loss=0.226, simple_loss=0.2971, pruned_loss=0.07746, over 15323.00 frames. ], tot_loss[loss=0.205, simple_loss=0.2854, pruned_loss=0.06229, over 4264215.63 frames. ], batch size: 60, lr: 3.81e-03, grad_scale: 32.0 2023-06-25 10:42:24,685 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-25 10:42:43,107 INFO [train.py:1028] (2/4) Epoch 8, validation: loss=0.2599, simple_loss=0.3542, pruned_loss=0.08283, over 1796401.00 frames. 2023-06-25 10:42:43,108 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 23731MB 2023-06-25 10:43:07,626 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1316832.0, ans=0.125 2023-06-25 10:44:20,624 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1317012.0, ans=0.0 2023-06-25 10:44:32,110 INFO [train.py:996] (2/4) Epoch 8, batch 6050, loss[loss=0.2085, simple_loss=0.2732, pruned_loss=0.07194, over 21424.00 frames. ], tot_loss[loss=0.2031, simple_loss=0.2799, pruned_loss=0.06314, over 4261352.17 frames. ], batch size: 509, lr: 3.81e-03, grad_scale: 16.0 2023-06-25 10:45:35,064 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1317192.0, ans=0.1 2023-06-25 10:45:35,534 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.24 vs. limit=15.0 2023-06-25 10:45:40,007 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1317252.0, ans=0.125 2023-06-25 10:46:02,680 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.258e+02 3.027e+02 3.543e+02 4.966e+02 9.624e+02, threshold=7.086e+02, percent-clipped=3.0 2023-06-25 10:46:09,933 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1317312.0, ans=0.04949747468305833 2023-06-25 10:46:13,022 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1317312.0, ans=0.125 2023-06-25 10:46:21,252 INFO [train.py:996] (2/4) Epoch 8, batch 6100, loss[loss=0.2004, simple_loss=0.3018, pruned_loss=0.0495, over 19849.00 frames. ], tot_loss[loss=0.2022, simple_loss=0.2802, pruned_loss=0.06208, over 4270141.37 frames. ], batch size: 702, lr: 3.81e-03, grad_scale: 8.0 2023-06-25 10:46:21,921 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1317372.0, ans=0.125 2023-06-25 10:47:06,698 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.10 vs. limit=15.0 2023-06-25 10:47:24,575 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1317492.0, ans=0.1 2023-06-25 10:47:24,584 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1317492.0, ans=0.125 2023-06-25 10:47:40,651 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.11 vs. limit=15.0 2023-06-25 10:48:09,200 INFO [train.py:996] (2/4) Epoch 8, batch 6150, loss[loss=0.2132, simple_loss=0.283, pruned_loss=0.07171, over 21839.00 frames. ], tot_loss[loss=0.2059, simple_loss=0.283, pruned_loss=0.06441, over 4283448.90 frames. ], batch size: 351, lr: 3.81e-03, grad_scale: 8.0 2023-06-25 10:48:17,472 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1317672.0, ans=0.125 2023-06-25 10:48:20,353 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1317672.0, ans=0.125 2023-06-25 10:48:33,005 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1317732.0, ans=0.125 2023-06-25 10:48:36,449 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1317732.0, ans=0.2 2023-06-25 10:48:40,220 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.24 vs. limit=12.0 2023-06-25 10:49:00,827 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1317792.0, ans=0.125 2023-06-25 10:49:38,029 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.673e+02 3.233e+02 3.904e+02 5.485e+02 1.131e+03, threshold=7.808e+02, percent-clipped=12.0 2023-06-25 10:49:58,271 INFO [train.py:996] (2/4) Epoch 8, batch 6200, loss[loss=0.2216, simple_loss=0.2978, pruned_loss=0.07273, over 21525.00 frames. ], tot_loss[loss=0.2079, simple_loss=0.2862, pruned_loss=0.06486, over 4286989.12 frames. ], batch size: 131, lr: 3.81e-03, grad_scale: 8.0 2023-06-25 10:50:52,534 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 10:51:02,173 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1318092.0, ans=0.1 2023-06-25 10:51:28,756 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1318152.0, ans=0.2 2023-06-25 10:51:46,800 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1318212.0, ans=0.125 2023-06-25 10:51:49,431 INFO [train.py:996] (2/4) Epoch 8, batch 6250, loss[loss=0.2151, simple_loss=0.3199, pruned_loss=0.05513, over 21694.00 frames. ], tot_loss[loss=0.2108, simple_loss=0.2921, pruned_loss=0.06472, over 4274189.67 frames. ], batch size: 298, lr: 3.80e-03, grad_scale: 8.0 2023-06-25 10:52:55,893 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1318392.0, ans=0.125 2023-06-25 10:53:17,968 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.870e+02 4.537e+02 6.426e+02 9.551e+02 1.693e+03, threshold=1.285e+03, percent-clipped=41.0 2023-06-25 10:53:25,363 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1318512.0, ans=0.0 2023-06-25 10:53:25,437 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1318512.0, ans=0.125 2023-06-25 10:53:42,670 INFO [train.py:996] (2/4) Epoch 8, batch 6300, loss[loss=0.2198, simple_loss=0.2938, pruned_loss=0.07292, over 21468.00 frames. ], tot_loss[loss=0.213, simple_loss=0.297, pruned_loss=0.06454, over 4276496.89 frames. ], batch size: 131, lr: 3.80e-03, grad_scale: 8.0 2023-06-25 10:54:40,558 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1318692.0, ans=0.04949747468305833 2023-06-25 10:54:44,678 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.19 vs. limit=10.0 2023-06-25 10:54:45,580 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1318692.0, ans=0.125 2023-06-25 10:55:42,029 INFO [train.py:996] (2/4) Epoch 8, batch 6350, loss[loss=0.2676, simple_loss=0.3435, pruned_loss=0.09588, over 21353.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.2984, pruned_loss=0.06858, over 4278927.12 frames. ], batch size: 159, lr: 3.80e-03, grad_scale: 8.0 2023-06-25 10:55:43,413 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.87 vs. limit=15.0 2023-06-25 10:56:02,332 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1318932.0, ans=0.1 2023-06-25 10:56:24,275 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.13 vs. limit=15.0 2023-06-25 10:56:29,395 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.68 vs. limit=22.5 2023-06-25 10:57:02,239 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.857e+02 3.830e+02 4.751e+02 5.817e+02 1.226e+03, threshold=9.501e+02, percent-clipped=0.0 2023-06-25 10:57:24,310 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1319112.0, ans=0.125 2023-06-25 10:57:27,553 INFO [train.py:996] (2/4) Epoch 8, batch 6400, loss[loss=0.2342, simple_loss=0.3023, pruned_loss=0.08308, over 21832.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.3034, pruned_loss=0.07261, over 4277693.50 frames. ], batch size: 247, lr: 3.80e-03, grad_scale: 16.0 2023-06-25 10:58:23,192 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1319292.0, ans=0.125 2023-06-25 10:59:02,635 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1319412.0, ans=0.1 2023-06-25 10:59:17,490 INFO [train.py:996] (2/4) Epoch 8, batch 6450, loss[loss=0.2019, simple_loss=0.2988, pruned_loss=0.05254, over 21679.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.3055, pruned_loss=0.07143, over 4276835.09 frames. ], batch size: 247, lr: 3.80e-03, grad_scale: 16.0 2023-06-25 10:59:58,092 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1319592.0, ans=0.125 2023-06-25 11:00:25,802 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.54 vs. limit=10.0 2023-06-25 11:00:42,536 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.679e+02 3.948e+02 4.858e+02 6.624e+02 1.248e+03, threshold=9.716e+02, percent-clipped=3.0 2023-06-25 11:01:06,965 INFO [train.py:996] (2/4) Epoch 8, batch 6500, loss[loss=0.2662, simple_loss=0.3353, pruned_loss=0.0985, over 21378.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.3, pruned_loss=0.06983, over 4274280.21 frames. ], batch size: 507, lr: 3.80e-03, grad_scale: 16.0 2023-06-25 11:01:27,677 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.29 vs. limit=15.0 2023-06-25 11:01:43,207 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1319892.0, ans=0.2 2023-06-25 11:02:16,621 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1319952.0, ans=0.0 2023-06-25 11:02:37,573 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.37 vs. limit=15.0 2023-06-25 11:02:57,211 INFO [train.py:996] (2/4) Epoch 8, batch 6550, loss[loss=0.2181, simple_loss=0.2938, pruned_loss=0.07121, over 21868.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.2991, pruned_loss=0.06866, over 4272553.75 frames. ], batch size: 351, lr: 3.80e-03, grad_scale: 16.0 2023-06-25 11:03:16,966 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1320132.0, ans=0.0 2023-06-25 11:03:20,692 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.52 vs. limit=15.0 2023-06-25 11:03:34,406 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.07 vs. limit=15.0 2023-06-25 11:03:45,792 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1320192.0, ans=0.125 2023-06-25 11:04:22,988 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.651e+02 3.584e+02 5.538e+02 7.556e+02 1.701e+03, threshold=1.108e+03, percent-clipped=14.0 2023-06-25 11:04:46,611 INFO [train.py:996] (2/4) Epoch 8, batch 6600, loss[loss=0.1803, simple_loss=0.2442, pruned_loss=0.0582, over 21794.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.293, pruned_loss=0.06857, over 4269704.76 frames. ], batch size: 283, lr: 3.80e-03, grad_scale: 8.0 2023-06-25 11:05:09,996 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1320432.0, ans=0.2 2023-06-25 11:05:16,089 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.22 vs. limit=22.5 2023-06-25 11:05:21,610 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.18 vs. limit=22.5 2023-06-25 11:05:33,504 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.76 vs. limit=10.0 2023-06-25 11:06:08,444 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1320552.0, ans=0.2 2023-06-25 11:06:15,999 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1320612.0, ans=0.1 2023-06-25 11:06:36,618 INFO [train.py:996] (2/4) Epoch 8, batch 6650, loss[loss=0.1685, simple_loss=0.2464, pruned_loss=0.04532, over 21535.00 frames. ], tot_loss[loss=0.209, simple_loss=0.2849, pruned_loss=0.06652, over 4267204.65 frames. ], batch size: 230, lr: 3.80e-03, grad_scale: 8.0 2023-06-25 11:06:54,970 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.58 vs. limit=15.0 2023-06-25 11:06:57,725 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1320732.0, ans=0.125 2023-06-25 11:07:15,851 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1320792.0, ans=0.1 2023-06-25 11:07:41,753 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.06 vs. limit=15.0 2023-06-25 11:08:03,791 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.500e+02 3.315e+02 4.377e+02 5.902e+02 1.210e+03, threshold=8.754e+02, percent-clipped=3.0 2023-06-25 11:08:26,394 INFO [train.py:996] (2/4) Epoch 8, batch 6700, loss[loss=0.172, simple_loss=0.2221, pruned_loss=0.06098, over 20784.00 frames. ], tot_loss[loss=0.2049, simple_loss=0.2784, pruned_loss=0.06572, over 4267733.08 frames. ], batch size: 608, lr: 3.80e-03, grad_scale: 8.0 2023-06-25 11:08:31,929 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1320972.0, ans=0.125 2023-06-25 11:08:31,938 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1320972.0, ans=0.125 2023-06-25 11:09:22,476 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.74 vs. limit=15.0 2023-06-25 11:09:34,463 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1321152.0, ans=0.95 2023-06-25 11:09:38,018 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1321152.0, ans=0.2 2023-06-25 11:09:53,898 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.13 vs. limit=5.0 2023-06-25 11:10:10,088 INFO [train.py:996] (2/4) Epoch 8, batch 6750, loss[loss=0.1952, simple_loss=0.2763, pruned_loss=0.05707, over 16250.00 frames. ], tot_loss[loss=0.204, simple_loss=0.2767, pruned_loss=0.06564, over 4259691.79 frames. ], batch size: 61, lr: 3.80e-03, grad_scale: 8.0 2023-06-25 11:10:59,369 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1321392.0, ans=0.125 2023-06-25 11:10:59,438 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1321392.0, ans=0.0 2023-06-25 11:11:14,660 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.whiten.whitening_limit, batch_count=1321392.0, ans=12.0 2023-06-25 11:11:35,907 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.658e+02 3.451e+02 4.455e+02 6.236e+02 1.487e+03, threshold=8.910e+02, percent-clipped=11.0 2023-06-25 11:11:58,609 INFO [train.py:996] (2/4) Epoch 8, batch 6800, loss[loss=0.1866, simple_loss=0.2538, pruned_loss=0.05974, over 21375.00 frames. ], tot_loss[loss=0.2068, simple_loss=0.2783, pruned_loss=0.06764, over 4271042.22 frames. ], batch size: 194, lr: 3.80e-03, grad_scale: 16.0 2023-06-25 11:12:01,165 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1321572.0, ans=0.0 2023-06-25 11:12:07,768 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1321572.0, ans=0.125 2023-06-25 11:13:41,737 INFO [train.py:996] (2/4) Epoch 8, batch 6850, loss[loss=0.2284, simple_loss=0.2865, pruned_loss=0.08511, over 21667.00 frames. ], tot_loss[loss=0.2078, simple_loss=0.2776, pruned_loss=0.069, over 4279045.29 frames. ], batch size: 441, lr: 3.80e-03, grad_scale: 16.0 2023-06-25 11:14:00,480 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=1321932.0, ans=15.0 2023-06-25 11:14:17,464 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1321992.0, ans=0.1 2023-06-25 11:15:06,732 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1322052.0, ans=0.125 2023-06-25 11:15:09,322 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.681e+02 3.753e+02 5.063e+02 7.364e+02 1.523e+03, threshold=1.013e+03, percent-clipped=16.0 2023-06-25 11:15:27,934 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1322112.0, ans=0.0 2023-06-25 11:15:32,242 INFO [train.py:996] (2/4) Epoch 8, batch 6900, loss[loss=0.2266, simple_loss=0.2928, pruned_loss=0.08018, over 21790.00 frames. ], tot_loss[loss=0.2095, simple_loss=0.28, pruned_loss=0.06954, over 4286553.35 frames. ], batch size: 391, lr: 3.80e-03, grad_scale: 16.0 2023-06-25 11:15:52,681 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1322232.0, ans=0.1 2023-06-25 11:16:00,023 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1322232.0, ans=0.1 2023-06-25 11:16:35,983 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1322292.0, ans=0.0 2023-06-25 11:16:41,822 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1322352.0, ans=0.04949747468305833 2023-06-25 11:17:05,275 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1322412.0, ans=0.125 2023-06-25 11:17:23,580 INFO [train.py:996] (2/4) Epoch 8, batch 6950, loss[loss=0.2411, simple_loss=0.3137, pruned_loss=0.08426, over 21296.00 frames. ], tot_loss[loss=0.2074, simple_loss=0.282, pruned_loss=0.06639, over 4275791.94 frames. ], batch size: 159, lr: 3.80e-03, grad_scale: 16.0 2023-06-25 11:17:31,503 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1322472.0, ans=0.125 2023-06-25 11:18:22,766 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=7.02 vs. limit=15.0 2023-06-25 11:18:25,938 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1322592.0, ans=0.1 2023-06-25 11:18:35,822 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1322652.0, ans=0.0 2023-06-25 11:18:51,692 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1322652.0, ans=0.0 2023-06-25 11:18:54,431 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.354e+02 3.544e+02 4.966e+02 6.681e+02 1.694e+03, threshold=9.931e+02, percent-clipped=7.0 2023-06-25 11:19:12,235 INFO [train.py:996] (2/4) Epoch 8, batch 7000, loss[loss=0.1881, simple_loss=0.2551, pruned_loss=0.0605, over 21604.00 frames. ], tot_loss[loss=0.2106, simple_loss=0.2845, pruned_loss=0.06832, over 4281019.96 frames. ], batch size: 247, lr: 3.80e-03, grad_scale: 16.0 2023-06-25 11:19:38,043 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1322832.0, ans=0.0 2023-06-25 11:20:32,679 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1322952.0, ans=0.125 2023-06-25 11:20:41,929 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1323012.0, ans=0.125 2023-06-25 11:20:56,842 INFO [train.py:996] (2/4) Epoch 8, batch 7050, loss[loss=0.1976, simple_loss=0.2858, pruned_loss=0.05471, over 21744.00 frames. ], tot_loss[loss=0.2088, simple_loss=0.2825, pruned_loss=0.06759, over 4281996.71 frames. ], batch size: 351, lr: 3.80e-03, grad_scale: 16.0 2023-06-25 11:21:00,936 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 11:21:13,802 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1323072.0, ans=0.1 2023-06-25 11:21:33,548 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1323132.0, ans=0.0 2023-06-25 11:21:53,902 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1323192.0, ans=0.125 2023-06-25 11:22:18,073 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.38 vs. limit=10.0 2023-06-25 11:22:18,998 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1323252.0, ans=0.125 2023-06-25 11:22:30,814 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.752e+02 3.663e+02 4.659e+02 6.225e+02 9.950e+02, threshold=9.319e+02, percent-clipped=1.0 2023-06-25 11:22:35,218 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1323312.0, ans=0.125 2023-06-25 11:22:48,494 INFO [train.py:996] (2/4) Epoch 8, batch 7100, loss[loss=0.2254, simple_loss=0.3098, pruned_loss=0.07048, over 21584.00 frames. ], tot_loss[loss=0.2113, simple_loss=0.2867, pruned_loss=0.06792, over 4275309.63 frames. ], batch size: 389, lr: 3.80e-03, grad_scale: 16.0 2023-06-25 11:23:21,444 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1323432.0, ans=0.0 2023-06-25 11:23:42,856 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1323492.0, ans=0.1 2023-06-25 11:23:44,583 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1323492.0, ans=0.0 2023-06-25 11:23:44,626 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1323492.0, ans=0.0 2023-06-25 11:24:44,521 INFO [train.py:996] (2/4) Epoch 8, batch 7150, loss[loss=0.2326, simple_loss=0.3044, pruned_loss=0.08044, over 21408.00 frames. ], tot_loss[loss=0.2089, simple_loss=0.2852, pruned_loss=0.06631, over 4266209.95 frames. ], batch size: 194, lr: 3.80e-03, grad_scale: 16.0 2023-06-25 11:25:24,020 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1323792.0, ans=0.1 2023-06-25 11:25:52,293 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1323852.0, ans=0.0 2023-06-25 11:26:11,774 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.272e+02 3.580e+02 4.514e+02 6.175e+02 1.199e+03, threshold=9.027e+02, percent-clipped=4.0 2023-06-25 11:26:14,210 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1323912.0, ans=0.1 2023-06-25 11:26:25,557 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1323912.0, ans=0.125 2023-06-25 11:26:34,600 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1323912.0, ans=0.125 2023-06-25 11:26:36,764 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.45 vs. limit=15.0 2023-06-25 11:26:40,812 INFO [train.py:996] (2/4) Epoch 8, batch 7200, loss[loss=0.2022, simple_loss=0.2697, pruned_loss=0.06731, over 21816.00 frames. ], tot_loss[loss=0.2123, simple_loss=0.2876, pruned_loss=0.06851, over 4268858.01 frames. ], batch size: 352, lr: 3.80e-03, grad_scale: 32.0 2023-06-25 11:27:06,823 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1324032.0, ans=0.1 2023-06-25 11:27:12,584 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1324032.0, ans=0.1 2023-06-25 11:27:39,621 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1324152.0, ans=0.125 2023-06-25 11:27:57,913 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1324152.0, ans=0.125 2023-06-25 11:28:02,056 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.39 vs. limit=12.0 2023-06-25 11:28:28,901 INFO [train.py:996] (2/4) Epoch 8, batch 7250, loss[loss=0.2, simple_loss=0.266, pruned_loss=0.067, over 21829.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2836, pruned_loss=0.06858, over 4270679.63 frames. ], batch size: 352, lr: 3.80e-03, grad_scale: 16.0 2023-06-25 11:28:46,090 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1324332.0, ans=0.1 2023-06-25 11:29:14,139 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1324392.0, ans=0.0 2023-06-25 11:29:46,634 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 11:29:47,159 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.87 vs. limit=15.0 2023-06-25 11:29:51,147 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.614e+02 3.605e+02 4.552e+02 6.343e+02 1.382e+03, threshold=9.103e+02, percent-clipped=6.0 2023-06-25 11:30:11,423 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.23 vs. limit=15.0 2023-06-25 11:30:17,030 INFO [train.py:996] (2/4) Epoch 8, batch 7300, loss[loss=0.1785, simple_loss=0.2453, pruned_loss=0.05585, over 21635.00 frames. ], tot_loss[loss=0.2076, simple_loss=0.2794, pruned_loss=0.06788, over 4263426.21 frames. ], batch size: 264, lr: 3.80e-03, grad_scale: 16.0 2023-06-25 11:30:21,219 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1324572.0, ans=0.125 2023-06-25 11:30:37,205 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1324632.0, ans=0.125 2023-06-25 11:30:46,238 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1324632.0, ans=0.0 2023-06-25 11:30:53,199 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1324632.0, ans=0.0 2023-06-25 11:31:19,088 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1324752.0, ans=0.1 2023-06-25 11:31:41,232 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1324812.0, ans=0.0 2023-06-25 11:32:07,771 INFO [train.py:996] (2/4) Epoch 8, batch 7350, loss[loss=0.2361, simple_loss=0.3024, pruned_loss=0.08492, over 21577.00 frames. ], tot_loss[loss=0.2077, simple_loss=0.2783, pruned_loss=0.06861, over 4264666.38 frames. ], batch size: 415, lr: 3.80e-03, grad_scale: 16.0 2023-06-25 11:32:14,540 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1324872.0, ans=0.125 2023-06-25 11:32:44,479 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1324932.0, ans=0.2 2023-06-25 11:32:51,108 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=6.49 vs. limit=22.5 2023-06-25 11:33:08,418 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1325052.0, ans=0.1 2023-06-25 11:33:44,455 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.590e+02 4.058e+02 5.630e+02 9.164e+02 1.929e+03, threshold=1.126e+03, percent-clipped=26.0 2023-06-25 11:33:57,953 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1325112.0, ans=0.2 2023-06-25 11:34:01,236 INFO [train.py:996] (2/4) Epoch 8, batch 7400, loss[loss=0.2547, simple_loss=0.3176, pruned_loss=0.09586, over 21750.00 frames. ], tot_loss[loss=0.2133, simple_loss=0.2843, pruned_loss=0.07118, over 4265852.80 frames. ], batch size: 441, lr: 3.79e-03, grad_scale: 16.0 2023-06-25 11:34:29,656 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1325232.0, ans=0.1 2023-06-25 11:35:06,550 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1325352.0, ans=0.2 2023-06-25 11:35:51,174 INFO [train.py:996] (2/4) Epoch 8, batch 7450, loss[loss=0.2778, simple_loss=0.3204, pruned_loss=0.1176, over 21427.00 frames. ], tot_loss[loss=0.2113, simple_loss=0.2825, pruned_loss=0.07004, over 4270127.89 frames. ], batch size: 509, lr: 3.79e-03, grad_scale: 16.0 2023-06-25 11:36:27,790 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1325532.0, ans=0.0 2023-06-25 11:36:39,747 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.90 vs. limit=15.0 2023-06-25 11:37:06,049 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1325652.0, ans=0.125 2023-06-25 11:37:23,798 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1325652.0, ans=0.0 2023-06-25 11:37:28,597 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.598e+02 3.413e+02 4.464e+02 6.199e+02 1.662e+03, threshold=8.927e+02, percent-clipped=2.0 2023-06-25 11:37:50,179 INFO [train.py:996] (2/4) Epoch 8, batch 7500, loss[loss=0.3071, simple_loss=0.399, pruned_loss=0.1076, over 21482.00 frames. ], tot_loss[loss=0.215, simple_loss=0.2867, pruned_loss=0.07165, over 4265922.12 frames. ], batch size: 471, lr: 3.79e-03, grad_scale: 16.0 2023-06-25 11:38:48,153 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1325892.0, ans=0.1 2023-06-25 11:38:58,259 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.62 vs. limit=15.0 2023-06-25 11:39:37,770 INFO [train.py:996] (2/4) Epoch 8, batch 7550, loss[loss=0.2281, simple_loss=0.3227, pruned_loss=0.06674, over 21624.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.2961, pruned_loss=0.07177, over 4273179.07 frames. ], batch size: 414, lr: 3.79e-03, grad_scale: 16.0 2023-06-25 11:39:53,894 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1326132.0, ans=0.125 2023-06-25 11:39:54,488 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.19 vs. limit=10.0 2023-06-25 11:40:22,861 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1326192.0, ans=0.125 2023-06-25 11:40:47,088 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1326252.0, ans=10.0 2023-06-25 11:40:48,981 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1326252.0, ans=0.125 2023-06-25 11:41:03,726 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.92 vs. limit=15.0 2023-06-25 11:41:05,663 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.417e+02 3.672e+02 5.210e+02 9.088e+02 2.173e+03, threshold=1.042e+03, percent-clipped=24.0 2023-06-25 11:41:14,285 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.85 vs. limit=12.0 2023-06-25 11:41:26,458 INFO [train.py:996] (2/4) Epoch 8, batch 7600, loss[loss=0.248, simple_loss=0.3052, pruned_loss=0.09545, over 21647.00 frames. ], tot_loss[loss=0.2199, simple_loss=0.2975, pruned_loss=0.07112, over 4273950.58 frames. ], batch size: 471, lr: 3.79e-03, grad_scale: 32.0 2023-06-25 11:41:27,089 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1326372.0, ans=0.125 2023-06-25 11:41:55,890 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1326432.0, ans=0.125 2023-06-25 11:42:38,359 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1326552.0, ans=0.125 2023-06-25 11:42:47,442 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_ff3.min_abs, batch_count=1326612.0, ans=0.2 2023-06-25 11:42:52,758 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1326612.0, ans=0.125 2023-06-25 11:43:09,715 INFO [train.py:996] (2/4) Epoch 8, batch 7650, loss[loss=0.2321, simple_loss=0.3012, pruned_loss=0.08151, over 21295.00 frames. ], tot_loss[loss=0.2212, simple_loss=0.2971, pruned_loss=0.07258, over 4281419.13 frames. ], batch size: 143, lr: 3.79e-03, grad_scale: 32.0 2023-06-25 11:44:10,398 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1326792.0, ans=0.0 2023-06-25 11:44:37,606 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1326912.0, ans=0.2 2023-06-25 11:44:44,980 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.753e+02 3.604e+02 4.352e+02 5.552e+02 1.331e+03, threshold=8.705e+02, percent-clipped=4.0 2023-06-25 11:44:48,081 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=10.46 vs. limit=15.0 2023-06-25 11:44:59,565 INFO [train.py:996] (2/4) Epoch 8, batch 7700, loss[loss=0.2194, simple_loss=0.3086, pruned_loss=0.06514, over 21618.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.2988, pruned_loss=0.07484, over 4288271.64 frames. ], batch size: 230, lr: 3.79e-03, grad_scale: 16.0 2023-06-25 11:45:18,106 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1327032.0, ans=0.0 2023-06-25 11:45:30,596 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=1327032.0, ans=15.0 2023-06-25 11:46:46,203 INFO [train.py:996] (2/4) Epoch 8, batch 7750, loss[loss=0.2613, simple_loss=0.3645, pruned_loss=0.07911, over 21876.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.3047, pruned_loss=0.07578, over 4281277.77 frames. ], batch size: 372, lr: 3.79e-03, grad_scale: 8.0 2023-06-25 11:47:01,694 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1327272.0, ans=0.125 2023-06-25 11:47:51,554 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1327392.0, ans=0.0 2023-06-25 11:48:13,149 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1327452.0, ans=0.2 2023-06-25 11:48:24,329 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.42 vs. limit=12.0 2023-06-25 11:48:24,878 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.807e+02 4.127e+02 5.917e+02 8.235e+02 1.345e+03, threshold=1.183e+03, percent-clipped=19.0 2023-06-25 11:48:37,392 INFO [train.py:996] (2/4) Epoch 8, batch 7800, loss[loss=0.2117, simple_loss=0.2892, pruned_loss=0.06705, over 21753.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3055, pruned_loss=0.0758, over 4272527.03 frames. ], batch size: 332, lr: 3.79e-03, grad_scale: 8.0 2023-06-25 11:48:49,404 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.56 vs. limit=12.0 2023-06-25 11:49:52,735 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.43 vs. limit=15.0 2023-06-25 11:50:09,929 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1327812.0, ans=10.0 2023-06-25 11:50:26,499 INFO [train.py:996] (2/4) Epoch 8, batch 7850, loss[loss=0.2098, simple_loss=0.2739, pruned_loss=0.0728, over 21448.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.2991, pruned_loss=0.07421, over 4271338.89 frames. ], batch size: 389, lr: 3.79e-03, grad_scale: 8.0 2023-06-25 11:51:54,289 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1328052.0, ans=0.125 2023-06-25 11:52:07,056 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.783e+02 3.555e+02 5.085e+02 7.464e+02 1.705e+03, threshold=1.017e+03, percent-clipped=5.0 2023-06-25 11:52:26,567 INFO [train.py:996] (2/4) Epoch 8, batch 7900, loss[loss=0.22, simple_loss=0.3157, pruned_loss=0.06209, over 21752.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.2926, pruned_loss=0.07251, over 4259175.29 frames. ], batch size: 332, lr: 3.79e-03, grad_scale: 8.0 2023-06-25 11:53:06,419 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1328232.0, ans=0.125 2023-06-25 11:54:24,573 INFO [train.py:996] (2/4) Epoch 8, batch 7950, loss[loss=0.1773, simple_loss=0.2287, pruned_loss=0.06297, over 20710.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.297, pruned_loss=0.07195, over 4262902.71 frames. ], batch size: 609, lr: 3.79e-03, grad_scale: 8.0 2023-06-25 11:54:27,616 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.48 vs. limit=15.0 2023-06-25 11:54:46,275 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1328532.0, ans=0.0 2023-06-25 11:55:42,066 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1328652.0, ans=0.125 2023-06-25 11:56:11,369 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.729e+02 4.611e+02 6.417e+02 9.938e+02 3.239e+03, threshold=1.283e+03, percent-clipped=22.0 2023-06-25 11:56:15,925 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1328712.0, ans=0.125 2023-06-25 11:56:17,532 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1328712.0, ans=0.125 2023-06-25 11:56:23,241 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.95 vs. limit=15.0 2023-06-25 11:56:24,108 INFO [train.py:996] (2/4) Epoch 8, batch 8000, loss[loss=0.2894, simple_loss=0.3708, pruned_loss=0.1041, over 21467.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.2997, pruned_loss=0.07395, over 4261630.96 frames. ], batch size: 471, lr: 3.79e-03, grad_scale: 16.0 2023-06-25 11:56:24,866 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1328772.0, ans=0.125 2023-06-25 11:58:24,050 INFO [train.py:996] (2/4) Epoch 8, batch 8050, loss[loss=0.2128, simple_loss=0.2758, pruned_loss=0.07492, over 21082.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.3029, pruned_loss=0.07489, over 4264861.19 frames. ], batch size: 159, lr: 3.79e-03, grad_scale: 16.0 2023-06-25 11:58:33,791 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff2.min_abs, batch_count=1329072.0, ans=0.1 2023-06-25 11:58:44,525 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1329132.0, ans=0.1 2023-06-25 11:59:43,090 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1329252.0, ans=0.125 2023-06-25 12:00:03,820 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.754e+02 4.648e+02 6.798e+02 1.163e+03 2.924e+03, threshold=1.360e+03, percent-clipped=20.0 2023-06-25 12:00:04,585 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1329312.0, ans=0.2 2023-06-25 12:00:16,727 INFO [train.py:996] (2/4) Epoch 8, batch 8100, loss[loss=0.2097, simple_loss=0.2815, pruned_loss=0.06894, over 21906.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.3037, pruned_loss=0.07499, over 4267316.04 frames. ], batch size: 316, lr: 3.79e-03, grad_scale: 16.0 2023-06-25 12:01:15,091 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1329492.0, ans=0.125 2023-06-25 12:01:18,556 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1329492.0, ans=0.125 2023-06-25 12:01:42,722 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1329552.0, ans=0.125 2023-06-25 12:01:52,707 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.72 vs. limit=15.0 2023-06-25 12:02:15,827 INFO [train.py:996] (2/4) Epoch 8, batch 8150, loss[loss=0.2782, simple_loss=0.3835, pruned_loss=0.08648, over 21567.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.3117, pruned_loss=0.07603, over 4268260.23 frames. ], batch size: 441, lr: 3.79e-03, grad_scale: 16.0 2023-06-25 12:02:20,040 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1329672.0, ans=0.125 2023-06-25 12:02:31,648 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1329732.0, ans=0.0 2023-06-25 12:03:01,204 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1329792.0, ans=0.0 2023-06-25 12:03:47,515 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.902e+02 4.317e+02 6.289e+02 1.033e+03 2.172e+03, threshold=1.258e+03, percent-clipped=12.0 2023-06-25 12:04:04,950 INFO [train.py:996] (2/4) Epoch 8, batch 8200, loss[loss=0.1672, simple_loss=0.2299, pruned_loss=0.05223, over 16476.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.3028, pruned_loss=0.0733, over 4262379.18 frames. ], batch size: 63, lr: 3.79e-03, grad_scale: 16.0 2023-06-25 12:04:50,617 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1330092.0, ans=0.125 2023-06-25 12:05:01,254 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1330092.0, ans=0.125 2023-06-25 12:05:14,845 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1330152.0, ans=0.0 2023-06-25 12:05:49,021 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.34 vs. limit=15.0 2023-06-25 12:05:54,982 INFO [train.py:996] (2/4) Epoch 8, batch 8250, loss[loss=0.228, simple_loss=0.3163, pruned_loss=0.06985, over 21579.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.3022, pruned_loss=0.07328, over 4269438.61 frames. ], batch size: 230, lr: 3.79e-03, grad_scale: 16.0 2023-06-25 12:06:33,157 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1330332.0, ans=0.125 2023-06-25 12:06:38,005 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1330332.0, ans=0.125 2023-06-25 12:07:10,996 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.42 vs. limit=22.5 2023-06-25 12:07:27,234 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.445e+02 3.428e+02 4.253e+02 6.741e+02 1.234e+03, threshold=8.505e+02, percent-clipped=0.0 2023-06-25 12:07:50,390 INFO [train.py:996] (2/4) Epoch 8, batch 8300, loss[loss=0.181, simple_loss=0.2672, pruned_loss=0.04743, over 21634.00 frames. ], tot_loss[loss=0.221, simple_loss=0.3001, pruned_loss=0.07096, over 4270052.24 frames. ], batch size: 230, lr: 3.79e-03, grad_scale: 16.0 2023-06-25 12:08:34,745 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1330692.0, ans=0.1 2023-06-25 12:08:50,788 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1330692.0, ans=0.2 2023-06-25 12:09:04,643 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1330752.0, ans=0.125 2023-06-25 12:09:10,425 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.20 vs. limit=15.0 2023-06-25 12:09:38,999 INFO [train.py:996] (2/4) Epoch 8, batch 8350, loss[loss=0.261, simple_loss=0.3894, pruned_loss=0.06633, over 20766.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.3012, pruned_loss=0.07023, over 4267823.45 frames. ], batch size: 607, lr: 3.79e-03, grad_scale: 16.0 2023-06-25 12:10:33,174 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1330992.0, ans=0.125 2023-06-25 12:10:34,720 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1330992.0, ans=0.125 2023-06-25 12:10:58,658 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1331112.0, ans=0.125 2023-06-25 12:11:10,350 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.535e+02 3.482e+02 5.019e+02 7.188e+02 1.647e+03, threshold=1.004e+03, percent-clipped=15.0 2023-06-25 12:11:27,233 INFO [train.py:996] (2/4) Epoch 8, batch 8400, loss[loss=0.1694, simple_loss=0.2694, pruned_loss=0.03473, over 21708.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2982, pruned_loss=0.06738, over 4266186.09 frames. ], batch size: 332, lr: 3.79e-03, grad_scale: 32.0 2023-06-25 12:11:52,693 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1331232.0, ans=0.1 2023-06-25 12:12:03,452 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1331232.0, ans=0.0 2023-06-25 12:12:57,975 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.28 vs. limit=12.0 2023-06-25 12:13:15,397 INFO [train.py:996] (2/4) Epoch 8, batch 8450, loss[loss=0.1954, simple_loss=0.2775, pruned_loss=0.0566, over 21833.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2942, pruned_loss=0.06692, over 4266642.75 frames. ], batch size: 124, lr: 3.79e-03, grad_scale: 32.0 2023-06-25 12:13:21,465 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1331472.0, ans=0.04949747468305833 2023-06-25 12:13:43,860 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=7.63 vs. limit=12.0 2023-06-25 12:14:04,803 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1331592.0, ans=0.0 2023-06-25 12:14:45,596 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.627e+02 3.841e+02 5.103e+02 7.112e+02 1.474e+03, threshold=1.021e+03, percent-clipped=11.0 2023-06-25 12:15:04,513 INFO [train.py:996] (2/4) Epoch 8, batch 8500, loss[loss=0.1925, simple_loss=0.2617, pruned_loss=0.06166, over 21656.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.2903, pruned_loss=0.06811, over 4266091.98 frames. ], batch size: 247, lr: 3.79e-03, grad_scale: 32.0 2023-06-25 12:16:08,402 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1331952.0, ans=0.125 2023-06-25 12:16:26,235 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1332012.0, ans=0.125 2023-06-25 12:16:56,643 INFO [train.py:996] (2/4) Epoch 8, batch 8550, loss[loss=0.2479, simple_loss=0.3432, pruned_loss=0.07636, over 21849.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2944, pruned_loss=0.06945, over 4256590.00 frames. ], batch size: 371, lr: 3.78e-03, grad_scale: 16.0 2023-06-25 12:17:15,412 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1332072.0, ans=0.125 2023-06-25 12:18:06,256 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1332252.0, ans=0.5 2023-06-25 12:18:15,574 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1332252.0, ans=10.0 2023-06-25 12:18:16,094 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.52 vs. limit=12.0 2023-06-25 12:18:36,647 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.783e+02 4.136e+02 5.316e+02 7.631e+02 1.468e+03, threshold=1.063e+03, percent-clipped=11.0 2023-06-25 12:18:52,716 INFO [train.py:996] (2/4) Epoch 8, batch 8600, loss[loss=0.2275, simple_loss=0.312, pruned_loss=0.07146, over 21568.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.3005, pruned_loss=0.07142, over 4258295.52 frames. ], batch size: 389, lr: 3.78e-03, grad_scale: 16.0 2023-06-25 12:18:55,242 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1332372.0, ans=0.0 2023-06-25 12:19:09,671 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1332432.0, ans=0.1 2023-06-25 12:20:40,557 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1332612.0, ans=0.125 2023-06-25 12:20:43,309 INFO [train.py:996] (2/4) Epoch 8, batch 8650, loss[loss=0.1893, simple_loss=0.2727, pruned_loss=0.05299, over 21845.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.3056, pruned_loss=0.0714, over 4268676.97 frames. ], batch size: 107, lr: 3.78e-03, grad_scale: 16.0 2023-06-25 12:20:44,640 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.02 vs. limit=22.5 2023-06-25 12:20:46,262 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.05 vs. limit=15.0 2023-06-25 12:20:50,968 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1332672.0, ans=0.1 2023-06-25 12:21:25,265 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1332792.0, ans=0.125 2023-06-25 12:22:16,165 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.514e+02 3.872e+02 5.286e+02 7.583e+02 1.337e+03, threshold=1.057e+03, percent-clipped=5.0 2023-06-25 12:22:32,438 INFO [train.py:996] (2/4) Epoch 8, batch 8700, loss[loss=0.2521, simple_loss=0.3669, pruned_loss=0.06865, over 19921.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.2991, pruned_loss=0.06864, over 4257919.46 frames. ], batch size: 702, lr: 3.78e-03, grad_scale: 16.0 2023-06-25 12:22:38,190 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1332972.0, ans=0.0 2023-06-25 12:22:38,831 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=8.71 vs. limit=22.5 2023-06-25 12:23:24,840 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1333092.0, ans=0.04949747468305833 2023-06-25 12:23:46,259 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1333152.0, ans=0.0 2023-06-25 12:23:56,074 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1333152.0, ans=0.125 2023-06-25 12:24:21,815 INFO [train.py:996] (2/4) Epoch 8, batch 8750, loss[loss=0.2043, simple_loss=0.2774, pruned_loss=0.06561, over 21989.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.2965, pruned_loss=0.06927, over 4264900.60 frames. ], batch size: 103, lr: 3.78e-03, grad_scale: 16.0 2023-06-25 12:24:31,397 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1333272.0, ans=0.04949747468305833 2023-06-25 12:24:52,414 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1333332.0, ans=0.125 2023-06-25 12:24:54,249 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1333332.0, ans=0.0 2023-06-25 12:25:26,476 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1333452.0, ans=10.0 2023-06-25 12:26:02,041 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.799e+02 3.947e+02 5.629e+02 7.790e+02 1.713e+03, threshold=1.126e+03, percent-clipped=18.0 2023-06-25 12:26:09,649 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1333512.0, ans=0.125 2023-06-25 12:26:18,071 INFO [train.py:996] (2/4) Epoch 8, batch 8800, loss[loss=0.2625, simple_loss=0.3424, pruned_loss=0.0913, over 21282.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.3063, pruned_loss=0.0727, over 4268384.80 frames. ], batch size: 143, lr: 3.78e-03, grad_scale: 32.0 2023-06-25 12:26:29,678 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1333572.0, ans=0.0 2023-06-25 12:26:30,412 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.52 vs. limit=15.0 2023-06-25 12:26:40,849 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1333632.0, ans=0.125 2023-06-25 12:26:55,815 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1333632.0, ans=0.2 2023-06-25 12:27:02,575 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1333692.0, ans=0.125 2023-06-25 12:27:41,772 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.89 vs. limit=15.0 2023-06-25 12:28:09,036 INFO [train.py:996] (2/4) Epoch 8, batch 8850, loss[loss=0.2023, simple_loss=0.286, pruned_loss=0.05927, over 21556.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.3106, pruned_loss=0.07425, over 4266540.46 frames. ], batch size: 263, lr: 3.78e-03, grad_scale: 16.0 2023-06-25 12:28:20,732 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 12:28:24,401 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1333872.0, ans=0.0 2023-06-25 12:28:44,354 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.40 vs. limit=10.0 2023-06-25 12:29:11,205 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1333992.0, ans=0.125 2023-06-25 12:29:14,808 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1334052.0, ans=0.1 2023-06-25 12:29:51,861 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.618e+02 3.568e+02 4.882e+02 6.738e+02 2.080e+03, threshold=9.764e+02, percent-clipped=3.0 2023-06-25 12:29:52,712 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1334112.0, ans=0.125 2023-06-25 12:30:01,483 INFO [train.py:996] (2/4) Epoch 8, batch 8900, loss[loss=0.2082, simple_loss=0.3172, pruned_loss=0.04956, over 21194.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.3042, pruned_loss=0.07336, over 4264859.51 frames. ], batch size: 549, lr: 3.78e-03, grad_scale: 16.0 2023-06-25 12:30:17,656 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.87 vs. limit=6.0 2023-06-25 12:30:18,808 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1334172.0, ans=0.125 2023-06-25 12:30:18,877 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1334172.0, ans=0.1 2023-06-25 12:30:33,568 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1334232.0, ans=0.0 2023-06-25 12:31:29,358 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1334352.0, ans=0.2 2023-06-25 12:31:59,190 INFO [train.py:996] (2/4) Epoch 8, batch 8950, loss[loss=0.2469, simple_loss=0.3215, pruned_loss=0.08618, over 21595.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.3073, pruned_loss=0.07307, over 4262454.83 frames. ], batch size: 414, lr: 3.78e-03, grad_scale: 16.0 2023-06-25 12:32:47,347 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1334592.0, ans=0.125 2023-06-25 12:33:13,166 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1334652.0, ans=0.2 2023-06-25 12:33:18,920 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 12:33:25,669 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1334712.0, ans=0.125 2023-06-25 12:33:34,162 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.718e+02 4.076e+02 6.080e+02 7.762e+02 1.933e+03, threshold=1.216e+03, percent-clipped=14.0 2023-06-25 12:33:48,744 INFO [train.py:996] (2/4) Epoch 8, batch 9000, loss[loss=0.1999, simple_loss=0.2805, pruned_loss=0.05967, over 21782.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.3009, pruned_loss=0.0721, over 4262931.53 frames. ], batch size: 317, lr: 3.78e-03, grad_scale: 16.0 2023-06-25 12:33:48,744 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-25 12:34:04,506 INFO [zipformer.py:1728] (2/4) name=encoder.encoders.2.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([1.7772, 3.9347, 3.7328, 3.9653], device='cuda:2') 2023-06-25 12:34:07,165 INFO [train.py:1028] (2/4) Epoch 8, validation: loss=0.2631, simple_loss=0.3554, pruned_loss=0.08544, over 1796401.00 frames. 2023-06-25 12:34:07,166 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 23731MB 2023-06-25 12:34:09,834 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1334772.0, ans=0.09899494936611666 2023-06-25 12:35:21,478 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1334952.0, ans=0.05 2023-06-25 12:35:57,424 INFO [train.py:996] (2/4) Epoch 8, batch 9050, loss[loss=0.2734, simple_loss=0.3562, pruned_loss=0.09536, over 21821.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.2959, pruned_loss=0.07024, over 4269723.92 frames. ], batch size: 118, lr: 3.78e-03, grad_scale: 16.0 2023-06-25 12:36:10,891 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1335072.0, ans=0.1 2023-06-25 12:36:10,989 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1335072.0, ans=0.1 2023-06-25 12:36:49,254 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1335192.0, ans=0.125 2023-06-25 12:37:00,395 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1335192.0, ans=0.125 2023-06-25 12:37:16,867 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1335252.0, ans=0.125 2023-06-25 12:37:47,101 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.667e+02 3.976e+02 5.366e+02 7.574e+02 1.688e+03, threshold=1.073e+03, percent-clipped=5.0 2023-06-25 12:37:55,906 INFO [train.py:996] (2/4) Epoch 8, batch 9100, loss[loss=0.2706, simple_loss=0.3335, pruned_loss=0.1038, over 21363.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.3019, pruned_loss=0.07253, over 4268117.90 frames. ], batch size: 507, lr: 3.78e-03, grad_scale: 16.0 2023-06-25 12:38:46,406 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1335492.0, ans=0.0 2023-06-25 12:38:47,902 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1335492.0, ans=0.125 2023-06-25 12:39:20,994 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1335552.0, ans=0.0 2023-06-25 12:39:39,191 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1335612.0, ans=0.1 2023-06-25 12:39:47,141 INFO [train.py:996] (2/4) Epoch 8, batch 9150, loss[loss=0.2133, simple_loss=0.2954, pruned_loss=0.06555, over 21238.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.3055, pruned_loss=0.07056, over 4271883.38 frames. ], batch size: 176, lr: 3.78e-03, grad_scale: 16.0 2023-06-25 12:41:16,568 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1335852.0, ans=0.0 2023-06-25 12:41:27,972 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.600e+02 3.582e+02 4.285e+02 5.759e+02 1.145e+03, threshold=8.570e+02, percent-clipped=4.0 2023-06-25 12:41:47,517 INFO [train.py:996] (2/4) Epoch 8, batch 9200, loss[loss=0.2125, simple_loss=0.2963, pruned_loss=0.06432, over 21302.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.307, pruned_loss=0.0692, over 4268129.33 frames. ], batch size: 159, lr: 3.78e-03, grad_scale: 32.0 2023-06-25 12:42:14,333 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1336032.0, ans=0.0 2023-06-25 12:43:37,056 INFO [train.py:996] (2/4) Epoch 8, batch 9250, loss[loss=0.2188, simple_loss=0.3294, pruned_loss=0.05411, over 19845.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.3078, pruned_loss=0.07133, over 4274654.42 frames. ], batch size: 702, lr: 3.78e-03, grad_scale: 32.0 2023-06-25 12:43:48,210 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 12:44:43,256 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.61 vs. limit=6.0 2023-06-25 12:44:45,240 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=1336452.0, ans=15.0 2023-06-25 12:45:21,452 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.801e+02 3.683e+02 5.339e+02 7.868e+02 1.539e+03, threshold=1.068e+03, percent-clipped=20.0 2023-06-25 12:45:22,102 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1336512.0, ans=0.0 2023-06-25 12:45:28,191 INFO [train.py:996] (2/4) Epoch 8, batch 9300, loss[loss=0.2485, simple_loss=0.3359, pruned_loss=0.08056, over 21719.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.302, pruned_loss=0.07106, over 4264636.88 frames. ], batch size: 332, lr: 3.78e-03, grad_scale: 16.0 2023-06-25 12:45:41,874 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.54 vs. limit=15.0 2023-06-25 12:45:48,125 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1336632.0, ans=0.125 2023-06-25 12:47:16,079 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1336812.0, ans=0.1 2023-06-25 12:47:19,173 INFO [train.py:996] (2/4) Epoch 8, batch 9350, loss[loss=0.2646, simple_loss=0.3401, pruned_loss=0.09458, over 21786.00 frames. ], tot_loss[loss=0.227, simple_loss=0.3085, pruned_loss=0.07268, over 4274797.02 frames. ], batch size: 441, lr: 3.78e-03, grad_scale: 16.0 2023-06-25 12:47:39,535 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1336872.0, ans=0.025 2023-06-25 12:48:34,130 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1337052.0, ans=0.0 2023-06-25 12:49:00,323 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1337112.0, ans=0.1 2023-06-25 12:49:02,855 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.027e+02 4.116e+02 5.791e+02 8.209e+02 2.175e+03, threshold=1.158e+03, percent-clipped=13.0 2023-06-25 12:49:10,258 INFO [train.py:996] (2/4) Epoch 8, batch 9400, loss[loss=0.198, simple_loss=0.2598, pruned_loss=0.0681, over 21337.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.3098, pruned_loss=0.07283, over 4272952.73 frames. ], batch size: 194, lr: 3.78e-03, grad_scale: 16.0 2023-06-25 12:49:39,205 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1337232.0, ans=0.0 2023-06-25 12:50:18,886 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1337292.0, ans=0.2 2023-06-25 12:51:05,930 INFO [train.py:996] (2/4) Epoch 8, batch 9450, loss[loss=0.2257, simple_loss=0.2782, pruned_loss=0.08662, over 21208.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.2996, pruned_loss=0.07186, over 4265726.85 frames. ], batch size: 471, lr: 3.78e-03, grad_scale: 16.0 2023-06-25 12:51:16,579 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1337472.0, ans=0.125 2023-06-25 12:52:07,651 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1337592.0, ans=10.0 2023-06-25 12:52:13,107 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.34 vs. limit=15.0 2023-06-25 12:52:18,484 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1337652.0, ans=0.04949747468305833 2023-06-25 12:52:20,022 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1337652.0, ans=0.125 2023-06-25 12:52:41,868 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.789e+02 4.276e+02 5.565e+02 7.806e+02 1.820e+03, threshold=1.113e+03, percent-clipped=7.0 2023-06-25 12:52:44,221 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1337712.0, ans=0.125 2023-06-25 12:52:48,845 INFO [train.py:996] (2/4) Epoch 8, batch 9500, loss[loss=0.2188, simple_loss=0.2939, pruned_loss=0.07184, over 21760.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.2935, pruned_loss=0.0706, over 4269409.70 frames. ], batch size: 124, lr: 3.78e-03, grad_scale: 16.0 2023-06-25 12:53:17,951 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1337832.0, ans=0.2 2023-06-25 12:53:52,330 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1337892.0, ans=0.125 2023-06-25 12:54:09,614 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1337952.0, ans=0.2 2023-06-25 12:54:16,759 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1337952.0, ans=0.125 2023-06-25 12:54:28,993 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1338012.0, ans=0.125 2023-06-25 12:54:43,655 INFO [train.py:996] (2/4) Epoch 8, batch 9550, loss[loss=0.2371, simple_loss=0.3155, pruned_loss=0.07931, over 21708.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.2974, pruned_loss=0.07153, over 4270146.67 frames. ], batch size: 351, lr: 3.78e-03, grad_scale: 16.0 2023-06-25 12:56:05,478 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1338252.0, ans=0.125 2023-06-25 12:56:26,049 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.967e+02 4.048e+02 5.374e+02 8.215e+02 1.903e+03, threshold=1.075e+03, percent-clipped=10.0 2023-06-25 12:56:27,122 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.83 vs. limit=15.0 2023-06-25 12:56:32,894 INFO [train.py:996] (2/4) Epoch 8, batch 9600, loss[loss=0.2607, simple_loss=0.3135, pruned_loss=0.1039, over 21744.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.2999, pruned_loss=0.07366, over 4280715.88 frames. ], batch size: 507, lr: 3.78e-03, grad_scale: 32.0 2023-06-25 12:56:48,113 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.20 vs. limit=22.5 2023-06-25 12:58:16,389 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=6.62 vs. limit=22.5 2023-06-25 12:58:24,371 INFO [train.py:996] (2/4) Epoch 8, batch 9650, loss[loss=0.2434, simple_loss=0.3196, pruned_loss=0.08363, over 21461.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.3002, pruned_loss=0.07347, over 4279907.64 frames. ], batch size: 211, lr: 3.78e-03, grad_scale: 32.0 2023-06-25 12:58:44,569 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1338672.0, ans=0.125 2023-06-25 12:58:44,651 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1338672.0, ans=0.0 2023-06-25 12:59:16,887 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1338792.0, ans=0.125 2023-06-25 13:00:07,429 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.726e+02 3.684e+02 4.580e+02 6.595e+02 1.807e+03, threshold=9.160e+02, percent-clipped=4.0 2023-06-25 13:00:20,088 INFO [train.py:996] (2/4) Epoch 8, batch 9700, loss[loss=0.2403, simple_loss=0.3519, pruned_loss=0.06435, over 20779.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.3029, pruned_loss=0.07314, over 4273525.69 frames. ], batch size: 608, lr: 3.78e-03, grad_scale: 32.0 2023-06-25 13:00:31,175 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1338972.0, ans=0.125 2023-06-25 13:00:59,403 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1339032.0, ans=0.04949747468305833 2023-06-25 13:02:02,400 INFO [train.py:996] (2/4) Epoch 8, batch 9750, loss[loss=0.2305, simple_loss=0.275, pruned_loss=0.09298, over 21222.00 frames. ], tot_loss[loss=0.2214, simple_loss=0.2977, pruned_loss=0.0725, over 4279710.46 frames. ], batch size: 471, lr: 3.77e-03, grad_scale: 32.0 2023-06-25 13:02:04,600 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1339272.0, ans=0.125 2023-06-25 13:02:04,648 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1339272.0, ans=0.2 2023-06-25 13:03:08,748 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1339452.0, ans=0.0 2023-06-25 13:03:22,435 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1339452.0, ans=0.1 2023-06-25 13:03:42,359 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.775e+02 3.743e+02 5.532e+02 7.768e+02 2.224e+03, threshold=1.106e+03, percent-clipped=14.0 2023-06-25 13:03:49,293 INFO [train.py:996] (2/4) Epoch 8, batch 9800, loss[loss=0.2236, simple_loss=0.291, pruned_loss=0.07814, over 21844.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.2981, pruned_loss=0.07271, over 4262809.56 frames. ], batch size: 414, lr: 3.77e-03, grad_scale: 32.0 2023-06-25 13:04:40,324 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1339692.0, ans=0.125 2023-06-25 13:05:37,682 INFO [train.py:996] (2/4) Epoch 8, batch 9850, loss[loss=0.2089, simple_loss=0.2793, pruned_loss=0.06923, over 21867.00 frames. ], tot_loss[loss=0.2199, simple_loss=0.2941, pruned_loss=0.07283, over 4257924.17 frames. ], batch size: 107, lr: 3.77e-03, grad_scale: 32.0 2023-06-25 13:06:15,673 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1339932.0, ans=0.125 2023-06-25 13:06:37,187 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.27 vs. limit=15.0 2023-06-25 13:06:41,408 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1339992.0, ans=0.125 2023-06-25 13:06:48,852 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1340052.0, ans=0.0 2023-06-25 13:07:15,147 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1340112.0, ans=0.125 2023-06-25 13:07:19,912 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.843e+02 3.728e+02 4.692e+02 6.683e+02 1.521e+03, threshold=9.384e+02, percent-clipped=6.0 2023-06-25 13:07:26,609 INFO [train.py:996] (2/4) Epoch 8, batch 9900, loss[loss=0.1945, simple_loss=0.2635, pruned_loss=0.06279, over 21883.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.29, pruned_loss=0.07212, over 4261244.18 frames. ], batch size: 373, lr: 3.77e-03, grad_scale: 32.0 2023-06-25 13:07:55,541 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1340232.0, ans=0.1 2023-06-25 13:08:02,142 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1340232.0, ans=0.1 2023-06-25 13:08:23,122 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1340292.0, ans=0.0 2023-06-25 13:08:58,491 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.35 vs. limit=12.0 2023-06-25 13:09:14,618 INFO [train.py:996] (2/4) Epoch 8, batch 9950, loss[loss=0.2342, simple_loss=0.2935, pruned_loss=0.08743, over 21570.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.2937, pruned_loss=0.07441, over 4262147.34 frames. ], batch size: 415, lr: 3.77e-03, grad_scale: 16.0 2023-06-25 13:09:55,543 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.88 vs. limit=5.0 2023-06-25 13:10:19,957 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1340592.0, ans=0.0 2023-06-25 13:10:33,933 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1340652.0, ans=0.125 2023-06-25 13:10:54,606 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.97 vs. limit=22.5 2023-06-25 13:10:59,870 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.606e+02 3.700e+02 4.924e+02 7.179e+02 1.701e+03, threshold=9.849e+02, percent-clipped=16.0 2023-06-25 13:11:02,500 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1340712.0, ans=0.2 2023-06-25 13:11:11,620 INFO [train.py:996] (2/4) Epoch 8, batch 10000, loss[loss=0.22, simple_loss=0.2892, pruned_loss=0.07543, over 21255.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.2891, pruned_loss=0.073, over 4253919.69 frames. ], batch size: 159, lr: 3.77e-03, grad_scale: 32.0 2023-06-25 13:11:19,295 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1340772.0, ans=0.2 2023-06-25 13:12:11,568 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.13 vs. limit=15.0 2023-06-25 13:13:02,332 INFO [train.py:996] (2/4) Epoch 8, batch 10050, loss[loss=0.1968, simple_loss=0.2727, pruned_loss=0.06045, over 21644.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.2927, pruned_loss=0.07414, over 4260559.73 frames. ], batch size: 415, lr: 3.77e-03, grad_scale: 32.0 2023-06-25 13:13:22,234 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1341072.0, ans=0.125 2023-06-25 13:13:23,841 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1341132.0, ans=0.0 2023-06-25 13:13:34,410 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_na.min_abs, batch_count=1341132.0, ans=0.02 2023-06-25 13:13:56,645 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1341192.0, ans=0.125 2023-06-25 13:14:18,255 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1341252.0, ans=0.125 2023-06-25 13:14:40,163 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1341312.0, ans=0.0 2023-06-25 13:14:55,227 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.655e+02 4.346e+02 5.951e+02 7.848e+02 1.633e+03, threshold=1.190e+03, percent-clipped=16.0 2023-06-25 13:14:58,753 INFO [train.py:996] (2/4) Epoch 8, batch 10100, loss[loss=0.2326, simple_loss=0.3243, pruned_loss=0.07044, over 21666.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.29, pruned_loss=0.07155, over 4251802.53 frames. ], batch size: 441, lr: 3.77e-03, grad_scale: 16.0 2023-06-25 13:15:01,363 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1341372.0, ans=0.0 2023-06-25 13:15:21,981 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1341432.0, ans=0.04949747468305833 2023-06-25 13:15:23,532 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1341432.0, ans=0.1 2023-06-25 13:15:23,572 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1341432.0, ans=0.95 2023-06-25 13:15:36,114 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1341432.0, ans=0.025 2023-06-25 13:15:52,120 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1341492.0, ans=0.5 2023-06-25 13:15:52,131 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1341492.0, ans=0.125 2023-06-25 13:16:01,335 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1341552.0, ans=0.1 2023-06-25 13:16:45,494 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 13:16:48,270 INFO [train.py:996] (2/4) Epoch 8, batch 10150, loss[loss=0.2032, simple_loss=0.2697, pruned_loss=0.06841, over 21210.00 frames. ], tot_loss[loss=0.222, simple_loss=0.2965, pruned_loss=0.07376, over 4264562.15 frames. ], batch size: 608, lr: 3.77e-03, grad_scale: 16.0 2023-06-25 13:17:16,670 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1341732.0, ans=0.1 2023-06-25 13:18:38,996 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.326e+02 3.458e+02 4.384e+02 5.388e+02 1.096e+03, threshold=8.768e+02, percent-clipped=0.0 2023-06-25 13:18:42,757 INFO [train.py:996] (2/4) Epoch 8, batch 10200, loss[loss=0.1854, simple_loss=0.2698, pruned_loss=0.0505, over 21437.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.2959, pruned_loss=0.07181, over 4268296.69 frames. ], batch size: 131, lr: 3.77e-03, grad_scale: 16.0 2023-06-25 13:18:55,708 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1341972.0, ans=0.1 2023-06-25 13:19:58,916 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1342152.0, ans=0.0 2023-06-25 13:20:32,433 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.84 vs. limit=10.0 2023-06-25 13:20:34,612 INFO [train.py:996] (2/4) Epoch 8, batch 10250, loss[loss=0.2319, simple_loss=0.3093, pruned_loss=0.07726, over 21283.00 frames. ], tot_loss[loss=0.213, simple_loss=0.2911, pruned_loss=0.06744, over 4264833.24 frames. ], batch size: 159, lr: 3.77e-03, grad_scale: 16.0 2023-06-25 13:20:42,498 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1342272.0, ans=0.125 2023-06-25 13:21:41,175 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1342452.0, ans=0.2 2023-06-25 13:21:51,713 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1342452.0, ans=0.125 2023-06-25 13:22:02,541 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1342512.0, ans=0.1 2023-06-25 13:22:10,340 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.15 vs. limit=6.0 2023-06-25 13:22:11,278 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1342512.0, ans=0.125 2023-06-25 13:22:23,170 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.001e+02 3.628e+02 5.027e+02 6.947e+02 1.354e+03, threshold=1.005e+03, percent-clipped=10.0 2023-06-25 13:22:26,723 INFO [train.py:996] (2/4) Epoch 8, batch 10300, loss[loss=0.2104, simple_loss=0.3006, pruned_loss=0.06007, over 21736.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2929, pruned_loss=0.06712, over 4266947.20 frames. ], batch size: 247, lr: 3.77e-03, grad_scale: 16.0 2023-06-25 13:23:05,098 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1342632.0, ans=0.0 2023-06-25 13:24:02,914 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1342812.0, ans=0.05 2023-06-25 13:24:18,478 INFO [train.py:996] (2/4) Epoch 8, batch 10350, loss[loss=0.1598, simple_loss=0.2166, pruned_loss=0.0515, over 21163.00 frames. ], tot_loss[loss=0.215, simple_loss=0.295, pruned_loss=0.06755, over 4273642.13 frames. ], batch size: 143, lr: 3.77e-03, grad_scale: 16.0 2023-06-25 13:24:31,699 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1342872.0, ans=0.0 2023-06-25 13:24:33,750 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.90 vs. limit=15.0 2023-06-25 13:24:34,807 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1342872.0, ans=0.125 2023-06-25 13:24:40,108 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1342932.0, ans=0.2 2023-06-25 13:25:19,948 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=7.63 vs. limit=15.0 2023-06-25 13:25:23,874 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.17 vs. limit=15.0 2023-06-25 13:26:05,203 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.961e+02 4.465e+02 6.325e+02 1.027e+03 2.051e+03, threshold=1.265e+03, percent-clipped=26.0 2023-06-25 13:26:15,289 INFO [train.py:996] (2/4) Epoch 8, batch 10400, loss[loss=0.2178, simple_loss=0.2965, pruned_loss=0.06954, over 21768.00 frames. ], tot_loss[loss=0.2085, simple_loss=0.2849, pruned_loss=0.06599, over 4258558.20 frames. ], batch size: 352, lr: 3.77e-03, grad_scale: 32.0 2023-06-25 13:26:26,176 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1343172.0, ans=0.1 2023-06-25 13:26:48,602 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.61 vs. limit=15.0 2023-06-25 13:27:58,441 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.35 vs. limit=22.5 2023-06-25 13:28:06,151 INFO [train.py:996] (2/4) Epoch 8, batch 10450, loss[loss=0.2002, simple_loss=0.2855, pruned_loss=0.05746, over 21800.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.2888, pruned_loss=0.06844, over 4259847.71 frames. ], batch size: 282, lr: 3.77e-03, grad_scale: 32.0 2023-06-25 13:28:31,569 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.64 vs. limit=15.0 2023-06-25 13:28:38,005 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1343532.0, ans=0.125 2023-06-25 13:29:02,667 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1343592.0, ans=0.05 2023-06-25 13:29:09,260 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1343592.0, ans=0.125 2023-06-25 13:29:37,825 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1343712.0, ans=0.2 2023-06-25 13:29:52,720 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.857e+02 4.046e+02 6.081e+02 8.924e+02 1.860e+03, threshold=1.216e+03, percent-clipped=7.0 2023-06-25 13:29:54,319 INFO [train.py:996] (2/4) Epoch 8, batch 10500, loss[loss=0.2207, simple_loss=0.2818, pruned_loss=0.07983, over 21509.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.2885, pruned_loss=0.06697, over 4259645.49 frames. ], batch size: 441, lr: 3.77e-03, grad_scale: 16.0 2023-06-25 13:30:26,954 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1343832.0, ans=0.1 2023-06-25 13:30:49,284 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1343892.0, ans=0.125 2023-06-25 13:30:50,051 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.64 vs. limit=15.0 2023-06-25 13:31:42,074 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.60 vs. limit=12.0 2023-06-25 13:31:44,400 INFO [train.py:996] (2/4) Epoch 8, batch 10550, loss[loss=0.2026, simple_loss=0.2697, pruned_loss=0.06768, over 21812.00 frames. ], tot_loss[loss=0.2091, simple_loss=0.2839, pruned_loss=0.06711, over 4253185.50 frames. ], batch size: 352, lr: 3.77e-03, grad_scale: 16.0 2023-06-25 13:32:22,347 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1344132.0, ans=0.2 2023-06-25 13:32:22,977 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=1344132.0, ans=15.0 2023-06-25 13:32:37,868 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1344192.0, ans=0.125 2023-06-25 13:32:43,787 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.90 vs. limit=5.0 2023-06-25 13:33:35,048 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.430e+02 3.879e+02 5.008e+02 7.044e+02 1.478e+03, threshold=1.002e+03, percent-clipped=2.0 2023-06-25 13:33:36,511 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.09 vs. limit=15.0 2023-06-25 13:33:37,138 INFO [train.py:996] (2/4) Epoch 8, batch 10600, loss[loss=0.1814, simple_loss=0.2612, pruned_loss=0.05082, over 21188.00 frames. ], tot_loss[loss=0.2061, simple_loss=0.2807, pruned_loss=0.06571, over 4256994.26 frames. ], batch size: 548, lr: 3.77e-03, grad_scale: 16.0 2023-06-25 13:34:03,849 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1344372.0, ans=0.0 2023-06-25 13:34:10,993 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1344432.0, ans=0.1 2023-06-25 13:34:16,299 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1344432.0, ans=0.0 2023-06-25 13:34:17,993 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1344432.0, ans=0.1 2023-06-25 13:34:23,619 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1344432.0, ans=0.0 2023-06-25 13:35:34,394 INFO [train.py:996] (2/4) Epoch 8, batch 10650, loss[loss=0.1408, simple_loss=0.2066, pruned_loss=0.03753, over 21781.00 frames. ], tot_loss[loss=0.2054, simple_loss=0.2816, pruned_loss=0.06456, over 4270226.65 frames. ], batch size: 124, lr: 3.77e-03, grad_scale: 16.0 2023-06-25 13:35:54,520 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1344672.0, ans=0.2 2023-06-25 13:36:08,532 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1344732.0, ans=0.125 2023-06-25 13:36:23,857 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=1344792.0, ans=0.95 2023-06-25 13:36:42,150 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1344852.0, ans=0.0 2023-06-25 13:36:52,730 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1344852.0, ans=0.1 2023-06-25 13:37:23,246 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.596e+02 3.773e+02 5.055e+02 6.605e+02 1.042e+03, threshold=1.011e+03, percent-clipped=1.0 2023-06-25 13:37:30,131 INFO [train.py:996] (2/4) Epoch 8, batch 10700, loss[loss=0.2646, simple_loss=0.3427, pruned_loss=0.09326, over 21396.00 frames. ], tot_loss[loss=0.2055, simple_loss=0.2808, pruned_loss=0.06507, over 4273892.16 frames. ], batch size: 131, lr: 3.77e-03, grad_scale: 16.0 2023-06-25 13:37:32,340 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1344972.0, ans=0.1 2023-06-25 13:37:34,180 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1344972.0, ans=0.0 2023-06-25 13:37:51,838 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1345032.0, ans=0.1 2023-06-25 13:39:07,153 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1345212.0, ans=0.125 2023-06-25 13:39:22,495 INFO [train.py:996] (2/4) Epoch 8, batch 10750, loss[loss=0.2398, simple_loss=0.3278, pruned_loss=0.07591, over 21751.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.293, pruned_loss=0.0696, over 4277036.07 frames. ], batch size: 247, lr: 3.77e-03, grad_scale: 16.0 2023-06-25 13:39:37,373 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1345272.0, ans=0.0 2023-06-25 13:40:11,820 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.76 vs. limit=10.0 2023-06-25 13:40:21,557 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1345452.0, ans=0.0 2023-06-25 13:41:03,865 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.47 vs. limit=22.5 2023-06-25 13:41:04,976 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1345512.0, ans=0.0 2023-06-25 13:41:05,931 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.724e+02 3.874e+02 4.652e+02 6.783e+02 1.933e+03, threshold=9.304e+02, percent-clipped=9.0 2023-06-25 13:41:08,326 INFO [train.py:996] (2/4) Epoch 8, batch 10800, loss[loss=0.2448, simple_loss=0.3183, pruned_loss=0.08568, over 21533.00 frames. ], tot_loss[loss=0.2183, simple_loss=0.2973, pruned_loss=0.06968, over 4276477.47 frames. ], batch size: 230, lr: 3.77e-03, grad_scale: 32.0 2023-06-25 13:41:08,970 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1345572.0, ans=0.125 2023-06-25 13:41:28,694 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1345632.0, ans=0.125 2023-06-25 13:41:39,537 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1345632.0, ans=0.0 2023-06-25 13:41:48,825 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1345692.0, ans=0.125 2023-06-25 13:41:50,328 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1345692.0, ans=0.0 2023-06-25 13:41:56,201 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.90 vs. limit=5.0 2023-06-25 13:42:08,253 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1345692.0, ans=0.0 2023-06-25 13:42:10,541 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.54 vs. limit=15.0 2023-06-25 13:42:11,774 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 13:42:11,837 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1345752.0, ans=0.0 2023-06-25 13:42:52,437 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1345872.0, ans=0.125 2023-06-25 13:42:53,467 INFO [train.py:996] (2/4) Epoch 8, batch 10850, loss[loss=0.2183, simple_loss=0.3179, pruned_loss=0.05939, over 20776.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.2984, pruned_loss=0.07012, over 4278163.18 frames. ], batch size: 609, lr: 3.77e-03, grad_scale: 16.0 2023-06-25 13:43:01,136 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1345872.0, ans=0.125 2023-06-25 13:44:43,293 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.620e+02 4.154e+02 5.827e+02 8.227e+02 1.341e+03, threshold=1.165e+03, percent-clipped=17.0 2023-06-25 13:44:43,325 INFO [train.py:996] (2/4) Epoch 8, batch 10900, loss[loss=0.231, simple_loss=0.3308, pruned_loss=0.06558, over 21599.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.2919, pruned_loss=0.06888, over 4267201.82 frames. ], batch size: 441, lr: 3.77e-03, grad_scale: 16.0 2023-06-25 13:46:20,225 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1346412.0, ans=0.1 2023-06-25 13:46:28,385 INFO [train.py:996] (2/4) Epoch 8, batch 10950, loss[loss=0.1897, simple_loss=0.2655, pruned_loss=0.05695, over 21370.00 frames. ], tot_loss[loss=0.2109, simple_loss=0.2876, pruned_loss=0.06713, over 4252329.87 frames. ], batch size: 131, lr: 3.76e-03, grad_scale: 16.0 2023-06-25 13:46:32,701 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 13:47:17,362 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1346592.0, ans=0.1 2023-06-25 13:47:39,928 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1346652.0, ans=0.1 2023-06-25 13:48:07,296 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1346712.0, ans=0.125 2023-06-25 13:48:10,264 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.690e+02 3.805e+02 5.172e+02 7.672e+02 1.562e+03, threshold=1.034e+03, percent-clipped=4.0 2023-06-25 13:48:10,311 INFO [train.py:996] (2/4) Epoch 8, batch 11000, loss[loss=0.2105, simple_loss=0.2627, pruned_loss=0.07918, over 20155.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2853, pruned_loss=0.06773, over 4251846.26 frames. ], batch size: 702, lr: 3.76e-03, grad_scale: 16.0 2023-06-25 13:48:30,533 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1346772.0, ans=0.125 2023-06-25 13:48:54,586 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1346892.0, ans=0.2 2023-06-25 13:49:15,648 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1346892.0, ans=0.125 2023-06-25 13:49:26,744 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1346952.0, ans=0.0 2023-06-25 13:49:28,721 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=1346952.0, ans=0.025 2023-06-25 13:49:32,311 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1346952.0, ans=0.125 2023-06-25 13:49:52,882 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1347012.0, ans=0.1 2023-06-25 13:49:58,300 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1347072.0, ans=0.035 2023-06-25 13:49:59,327 INFO [train.py:996] (2/4) Epoch 8, batch 11050, loss[loss=0.2066, simple_loss=0.2748, pruned_loss=0.06919, over 21802.00 frames. ], tot_loss[loss=0.2106, simple_loss=0.2833, pruned_loss=0.06899, over 4253902.33 frames. ], batch size: 112, lr: 3.76e-03, grad_scale: 16.0 2023-06-25 13:50:01,678 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1347072.0, ans=0.125 2023-06-25 13:50:06,988 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1347072.0, ans=0.04949747468305833 2023-06-25 13:51:26,096 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1347252.0, ans=0.2 2023-06-25 13:51:31,821 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1347312.0, ans=0.125 2023-06-25 13:51:49,994 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.876e+02 3.834e+02 4.608e+02 6.864e+02 1.206e+03, threshold=9.217e+02, percent-clipped=3.0 2023-06-25 13:51:50,040 INFO [train.py:996] (2/4) Epoch 8, batch 11100, loss[loss=0.2032, simple_loss=0.278, pruned_loss=0.06424, over 21663.00 frames. ], tot_loss[loss=0.2115, simple_loss=0.2838, pruned_loss=0.06954, over 4239970.01 frames. ], batch size: 282, lr: 3.76e-03, grad_scale: 16.0 2023-06-25 13:51:55,440 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1347372.0, ans=0.0 2023-06-25 13:53:08,275 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1347552.0, ans=0.125 2023-06-25 13:53:22,685 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1347612.0, ans=0.0 2023-06-25 13:53:24,320 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 13:53:39,322 INFO [train.py:996] (2/4) Epoch 8, batch 11150, loss[loss=0.2459, simple_loss=0.3328, pruned_loss=0.07948, over 21603.00 frames. ], tot_loss[loss=0.2106, simple_loss=0.282, pruned_loss=0.06963, over 4242557.42 frames. ], batch size: 441, lr: 3.76e-03, grad_scale: 16.0 2023-06-25 13:53:40,500 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.86 vs. limit=12.0 2023-06-25 13:54:37,497 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1347792.0, ans=0.0 2023-06-25 13:55:11,807 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.85 vs. limit=15.0 2023-06-25 13:55:22,812 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.920e+02 3.496e+02 4.428e+02 6.433e+02 1.139e+03, threshold=8.857e+02, percent-clipped=2.0 2023-06-25 13:55:22,859 INFO [train.py:996] (2/4) Epoch 8, batch 11200, loss[loss=0.1882, simple_loss=0.2534, pruned_loss=0.06151, over 21377.00 frames. ], tot_loss[loss=0.2096, simple_loss=0.2802, pruned_loss=0.06948, over 4240054.52 frames. ], batch size: 131, lr: 3.76e-03, grad_scale: 32.0 2023-06-25 13:55:46,319 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1348032.0, ans=0.0 2023-06-25 13:56:16,856 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1348092.0, ans=0.1 2023-06-25 13:56:26,736 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1348092.0, ans=0.125 2023-06-25 13:56:49,869 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1348152.0, ans=0.125 2023-06-25 13:56:52,134 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.65 vs. limit=15.0 2023-06-25 13:57:05,477 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1348212.0, ans=0.125 2023-06-25 13:57:10,459 INFO [train.py:996] (2/4) Epoch 8, batch 11250, loss[loss=0.2161, simple_loss=0.305, pruned_loss=0.0636, over 21588.00 frames. ], tot_loss[loss=0.2085, simple_loss=0.2794, pruned_loss=0.06881, over 4251871.12 frames. ], batch size: 414, lr: 3.76e-03, grad_scale: 32.0 2023-06-25 13:57:19,615 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1348272.0, ans=0.0 2023-06-25 13:57:46,232 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 13:58:03,853 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1348392.0, ans=0.1 2023-06-25 13:58:59,633 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.719e+02 3.512e+02 4.278e+02 5.867e+02 1.075e+03, threshold=8.556e+02, percent-clipped=3.0 2023-06-25 13:58:59,665 INFO [train.py:996] (2/4) Epoch 8, batch 11300, loss[loss=0.1892, simple_loss=0.2709, pruned_loss=0.05376, over 21866.00 frames. ], tot_loss[loss=0.2093, simple_loss=0.2812, pruned_loss=0.06867, over 4255259.68 frames. ], batch size: 316, lr: 3.76e-03, grad_scale: 32.0 2023-06-25 13:59:27,126 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.90 vs. limit=22.5 2023-06-25 13:59:56,544 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1348692.0, ans=0.1 2023-06-25 14:00:49,636 INFO [train.py:996] (2/4) Epoch 8, batch 11350, loss[loss=0.2123, simple_loss=0.3012, pruned_loss=0.06167, over 21711.00 frames. ], tot_loss[loss=0.2101, simple_loss=0.2829, pruned_loss=0.06868, over 4254090.48 frames. ], batch size: 263, lr: 3.76e-03, grad_scale: 16.0 2023-06-25 14:01:04,163 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1348872.0, ans=0.0 2023-06-25 14:01:35,691 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.17 vs. limit=15.0 2023-06-25 14:01:36,972 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1348992.0, ans=0.1 2023-06-25 14:02:41,712 INFO [train.py:996] (2/4) Epoch 8, batch 11400, loss[loss=0.2059, simple_loss=0.2964, pruned_loss=0.05766, over 21714.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2887, pruned_loss=0.07012, over 4252055.87 frames. ], batch size: 298, lr: 3.76e-03, grad_scale: 16.0 2023-06-25 14:02:43,600 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.820e+02 3.968e+02 4.967e+02 6.707e+02 2.156e+03, threshold=9.935e+02, percent-clipped=13.0 2023-06-25 14:03:12,644 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1349232.0, ans=0.0 2023-06-25 14:03:47,932 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1349292.0, ans=0.2 2023-06-25 14:03:51,380 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1349292.0, ans=0.0 2023-06-25 14:04:36,644 INFO [train.py:996] (2/4) Epoch 8, batch 11450, loss[loss=0.3008, simple_loss=0.36, pruned_loss=0.1208, over 21384.00 frames. ], tot_loss[loss=0.215, simple_loss=0.2904, pruned_loss=0.06977, over 4251665.08 frames. ], batch size: 508, lr: 3.76e-03, grad_scale: 16.0 2023-06-25 14:05:14,231 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.55 vs. limit=15.0 2023-06-25 14:05:38,813 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1349592.0, ans=0.0 2023-06-25 14:06:33,218 INFO [train.py:996] (2/4) Epoch 8, batch 11500, loss[loss=0.2256, simple_loss=0.304, pruned_loss=0.07365, over 21192.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2939, pruned_loss=0.07104, over 4254807.20 frames. ], batch size: 143, lr: 3.76e-03, grad_scale: 16.0 2023-06-25 14:06:34,726 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.573e+02 4.073e+02 4.904e+02 7.356e+02 1.531e+03, threshold=9.808e+02, percent-clipped=13.0 2023-06-25 14:06:55,399 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1349832.0, ans=0.04949747468305833 2023-06-25 14:07:09,321 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1349832.0, ans=0.125 2023-06-25 14:07:20,647 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1349892.0, ans=0.125 2023-06-25 14:07:35,744 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.63 vs. limit=15.0 2023-06-25 14:07:35,925 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.50 vs. limit=15.0 2023-06-25 14:08:20,937 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1350012.0, ans=0.0 2023-06-25 14:08:29,676 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1350072.0, ans=0.0 2023-06-25 14:08:31,020 INFO [train.py:996] (2/4) Epoch 8, batch 11550, loss[loss=0.1927, simple_loss=0.2544, pruned_loss=0.0655, over 20788.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.2992, pruned_loss=0.07098, over 4252985.97 frames. ], batch size: 608, lr: 3.76e-03, grad_scale: 16.0 2023-06-25 14:08:35,932 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1350072.0, ans=0.125 2023-06-25 14:09:20,352 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1350192.0, ans=0.0 2023-06-25 14:10:21,401 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1350372.0, ans=0.2 2023-06-25 14:10:22,650 INFO [train.py:996] (2/4) Epoch 8, batch 11600, loss[loss=0.2538, simple_loss=0.3365, pruned_loss=0.08555, over 21343.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.3155, pruned_loss=0.07363, over 4258680.49 frames. ], batch size: 159, lr: 3.76e-03, grad_scale: 32.0 2023-06-25 14:10:24,363 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.914e+02 4.338e+02 5.534e+02 7.509e+02 2.145e+03, threshold=1.107e+03, percent-clipped=20.0 2023-06-25 14:11:09,863 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.78 vs. limit=6.0 2023-06-25 14:11:16,649 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 14:11:25,415 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1350552.0, ans=0.07 2023-06-25 14:11:40,747 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1350552.0, ans=0.1 2023-06-25 14:12:12,227 INFO [train.py:996] (2/4) Epoch 8, batch 11650, loss[loss=0.2322, simple_loss=0.3114, pruned_loss=0.07648, over 21734.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.3228, pruned_loss=0.07407, over 4261784.74 frames. ], batch size: 351, lr: 3.76e-03, grad_scale: 32.0 2023-06-25 14:12:45,930 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1350732.0, ans=0.04949747468305833 2023-06-25 14:12:50,861 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1350792.0, ans=0.0 2023-06-25 14:13:51,479 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.31 vs. limit=15.0 2023-06-25 14:13:55,112 INFO [train.py:996] (2/4) Epoch 8, batch 11700, loss[loss=0.1922, simple_loss=0.2586, pruned_loss=0.06295, over 21572.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.313, pruned_loss=0.07341, over 4263374.56 frames. ], batch size: 263, lr: 3.76e-03, grad_scale: 16.0 2023-06-25 14:13:55,683 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1350972.0, ans=0.95 2023-06-25 14:13:58,338 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.769e+02 3.697e+02 5.318e+02 8.205e+02 1.649e+03, threshold=1.064e+03, percent-clipped=10.0 2023-06-25 14:14:22,195 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1351032.0, ans=0.025 2023-06-25 14:15:02,388 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1351152.0, ans=0.2 2023-06-25 14:15:04,121 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1351152.0, ans=0.125 2023-06-25 14:15:43,642 INFO [train.py:996] (2/4) Epoch 8, batch 11750, loss[loss=0.2498, simple_loss=0.2843, pruned_loss=0.1076, over 21524.00 frames. ], tot_loss[loss=0.2248, simple_loss=0.3033, pruned_loss=0.07317, over 4263792.23 frames. ], batch size: 512, lr: 3.76e-03, grad_scale: 16.0 2023-06-25 14:16:20,033 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1351332.0, ans=0.125 2023-06-25 14:16:21,989 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1351332.0, ans=0.125 2023-06-25 14:16:56,785 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1351452.0, ans=0.2 2023-06-25 14:17:40,849 INFO [train.py:996] (2/4) Epoch 8, batch 11800, loss[loss=0.2227, simple_loss=0.2935, pruned_loss=0.07594, over 21823.00 frames. ], tot_loss[loss=0.2271, simple_loss=0.3049, pruned_loss=0.07462, over 4255473.17 frames. ], batch size: 247, lr: 3.76e-03, grad_scale: 16.0 2023-06-25 14:17:44,257 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.704e+02 3.917e+02 5.538e+02 7.967e+02 1.804e+03, threshold=1.108e+03, percent-clipped=14.0 2023-06-25 14:17:50,284 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1351572.0, ans=0.125 2023-06-25 14:17:55,955 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1351572.0, ans=0.0 2023-06-25 14:18:01,190 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1351632.0, ans=0.125 2023-06-25 14:18:56,991 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff2.min_abs, batch_count=1351752.0, ans=0.1 2023-06-25 14:19:07,649 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1351812.0, ans=0.2 2023-06-25 14:19:30,354 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.60 vs. limit=22.5 2023-06-25 14:19:30,729 INFO [train.py:996] (2/4) Epoch 8, batch 11850, loss[loss=0.2201, simple_loss=0.3185, pruned_loss=0.06088, over 21776.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.3062, pruned_loss=0.07383, over 4265422.57 frames. ], batch size: 351, lr: 3.76e-03, grad_scale: 16.0 2023-06-25 14:19:34,847 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1351872.0, ans=0.1 2023-06-25 14:20:11,744 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.78 vs. limit=12.0 2023-06-25 14:20:22,259 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.12 vs. limit=10.0 2023-06-25 14:20:41,496 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1352052.0, ans=0.0 2023-06-25 14:21:02,570 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1352112.0, ans=0.125 2023-06-25 14:21:02,731 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1352112.0, ans=0.2 2023-06-25 14:21:22,243 INFO [train.py:996] (2/4) Epoch 8, batch 11900, loss[loss=0.2564, simple_loss=0.3339, pruned_loss=0.0894, over 21396.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.3057, pruned_loss=0.0713, over 4267914.89 frames. ], batch size: 507, lr: 3.76e-03, grad_scale: 16.0 2023-06-25 14:21:25,802 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.770e+02 3.589e+02 4.714e+02 6.474e+02 1.333e+03, threshold=9.428e+02, percent-clipped=3.0 2023-06-25 14:22:07,484 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1352232.0, ans=0.125 2023-06-25 14:22:36,919 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1352352.0, ans=0.125 2023-06-25 14:22:40,722 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.47 vs. limit=15.0 2023-06-25 14:22:42,801 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.52 vs. limit=6.0 2023-06-25 14:22:44,127 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.05 vs. limit=15.0 2023-06-25 14:22:54,418 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1352352.0, ans=0.125 2023-06-25 14:23:07,607 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.51 vs. limit=15.0 2023-06-25 14:23:16,551 INFO [train.py:996] (2/4) Epoch 8, batch 11950, loss[loss=0.2212, simple_loss=0.3275, pruned_loss=0.05743, over 21690.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.3066, pruned_loss=0.06854, over 4265693.25 frames. ], batch size: 247, lr: 3.76e-03, grad_scale: 16.0 2023-06-25 14:23:43,097 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1352472.0, ans=0.1 2023-06-25 14:24:03,568 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.62 vs. limit=15.0 2023-06-25 14:24:35,941 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1352652.0, ans=0.125 2023-06-25 14:24:41,323 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1352652.0, ans=0.125 2023-06-25 14:25:06,502 INFO [train.py:996] (2/4) Epoch 8, batch 12000, loss[loss=0.1887, simple_loss=0.259, pruned_loss=0.05921, over 21681.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.2999, pruned_loss=0.06696, over 4266317.74 frames. ], batch size: 282, lr: 3.76e-03, grad_scale: 32.0 2023-06-25 14:25:06,502 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-25 14:25:31,288 INFO [train.py:1028] (2/4) Epoch 8, validation: loss=0.2626, simple_loss=0.3537, pruned_loss=0.08577, over 1796401.00 frames. 2023-06-25 14:25:31,289 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 23731MB 2023-06-25 14:25:34,826 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.287e+02 3.581e+02 4.444e+02 6.606e+02 1.302e+03, threshold=8.887e+02, percent-clipped=8.0 2023-06-25 14:25:51,544 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.40 vs. limit=15.0 2023-06-25 14:26:18,277 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1352892.0, ans=0.0 2023-06-25 14:26:25,103 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1352892.0, ans=0.2 2023-06-25 14:26:49,861 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1353012.0, ans=0.0 2023-06-25 14:27:08,711 INFO [train.py:996] (2/4) Epoch 8, batch 12050, loss[loss=0.211, simple_loss=0.2838, pruned_loss=0.06909, over 21675.00 frames. ], tot_loss[loss=0.2167, simple_loss=0.2958, pruned_loss=0.06884, over 4267284.58 frames. ], batch size: 230, lr: 3.76e-03, grad_scale: 32.0 2023-06-25 14:27:25,297 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1353072.0, ans=0.125 2023-06-25 14:27:48,366 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1353132.0, ans=0.2 2023-06-25 14:28:01,244 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1353192.0, ans=0.125 2023-06-25 14:28:17,591 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1353192.0, ans=0.125 2023-06-25 14:28:34,128 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1353252.0, ans=0.125 2023-06-25 14:29:10,846 INFO [train.py:996] (2/4) Epoch 8, batch 12100, loss[loss=0.2786, simple_loss=0.341, pruned_loss=0.1081, over 21310.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.3032, pruned_loss=0.07248, over 4273581.63 frames. ], batch size: 176, lr: 3.76e-03, grad_scale: 32.0 2023-06-25 14:29:14,314 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.733e+02 4.401e+02 6.036e+02 8.453e+02 2.254e+03, threshold=1.207e+03, percent-clipped=23.0 2023-06-25 14:30:12,110 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=1353492.0, ans=15.0 2023-06-25 14:30:22,555 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1353552.0, ans=0.0 2023-06-25 14:30:37,229 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1353552.0, ans=0.125 2023-06-25 14:30:48,444 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.19 vs. limit=6.0 2023-06-25 14:31:09,975 INFO [train.py:996] (2/4) Epoch 8, batch 12150, loss[loss=0.2245, simple_loss=0.3187, pruned_loss=0.06518, over 21694.00 frames. ], tot_loss[loss=0.225, simple_loss=0.3055, pruned_loss=0.07223, over 4265732.83 frames. ], batch size: 298, lr: 3.75e-03, grad_scale: 32.0 2023-06-25 14:31:48,832 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1353732.0, ans=0.0 2023-06-25 14:32:19,650 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1353852.0, ans=0.125 2023-06-25 14:32:59,809 INFO [train.py:996] (2/4) Epoch 8, batch 12200, loss[loss=0.1841, simple_loss=0.2481, pruned_loss=0.05999, over 21589.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.3001, pruned_loss=0.07106, over 4266231.43 frames. ], batch size: 231, lr: 3.75e-03, grad_scale: 32.0 2023-06-25 14:33:03,415 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.861e+02 3.926e+02 5.745e+02 7.853e+02 1.417e+03, threshold=1.149e+03, percent-clipped=2.0 2023-06-25 14:34:07,346 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.16 vs. limit=15.0 2023-06-25 14:34:47,638 INFO [train.py:996] (2/4) Epoch 8, batch 12250, loss[loss=0.1782, simple_loss=0.2681, pruned_loss=0.04415, over 21705.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2926, pruned_loss=0.06796, over 4264383.26 frames. ], batch size: 415, lr: 3.75e-03, grad_scale: 16.0 2023-06-25 14:34:55,260 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1354272.0, ans=0.125 2023-06-25 14:35:14,835 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1354332.0, ans=0.125 2023-06-25 14:35:28,888 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1354392.0, ans=0.0 2023-06-25 14:35:41,038 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1354392.0, ans=0.1 2023-06-25 14:35:48,029 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1354452.0, ans=0.125 2023-06-25 14:35:58,415 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1354452.0, ans=0.125 2023-06-25 14:36:26,952 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.73 vs. limit=10.0 2023-06-25 14:36:36,604 INFO [train.py:996] (2/4) Epoch 8, batch 12300, loss[loss=0.255, simple_loss=0.3441, pruned_loss=0.08294, over 21660.00 frames. ], tot_loss[loss=0.2069, simple_loss=0.2867, pruned_loss=0.06359, over 4264802.00 frames. ], batch size: 441, lr: 3.75e-03, grad_scale: 16.0 2023-06-25 14:36:41,788 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.141e+02 3.509e+02 4.835e+02 7.096e+02 1.534e+03, threshold=9.669e+02, percent-clipped=2.0 2023-06-25 14:36:50,153 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.64 vs. limit=15.0 2023-06-25 14:37:05,884 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1354632.0, ans=0.0 2023-06-25 14:37:23,481 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1354692.0, ans=0.0 2023-06-25 14:38:25,422 INFO [train.py:996] (2/4) Epoch 8, batch 12350, loss[loss=0.196, simple_loss=0.3262, pruned_loss=0.03284, over 20782.00 frames. ], tot_loss[loss=0.209, simple_loss=0.2908, pruned_loss=0.06359, over 4266350.07 frames. ], batch size: 607, lr: 3.75e-03, grad_scale: 16.0 2023-06-25 14:40:12,727 INFO [train.py:996] (2/4) Epoch 8, batch 12400, loss[loss=0.2163, simple_loss=0.2825, pruned_loss=0.07504, over 21850.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2929, pruned_loss=0.06705, over 4280747.36 frames. ], batch size: 298, lr: 3.75e-03, grad_scale: 32.0 2023-06-25 14:40:13,290 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1355172.0, ans=0.125 2023-06-25 14:40:17,823 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.711e+02 4.388e+02 6.020e+02 7.604e+02 1.312e+03, threshold=1.204e+03, percent-clipped=10.0 2023-06-25 14:40:18,382 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1355172.0, ans=0.125 2023-06-25 14:40:20,142 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1355172.0, ans=0.1 2023-06-25 14:40:36,180 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1355232.0, ans=0.2 2023-06-25 14:40:53,942 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1355292.0, ans=0.1 2023-06-25 14:41:53,473 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1355412.0, ans=0.125 2023-06-25 14:42:03,075 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1355472.0, ans=0.0 2023-06-25 14:42:04,072 INFO [train.py:996] (2/4) Epoch 8, batch 12450, loss[loss=0.2407, simple_loss=0.3229, pruned_loss=0.07924, over 21919.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.2972, pruned_loss=0.07002, over 4284643.58 frames. ], batch size: 316, lr: 3.75e-03, grad_scale: 16.0 2023-06-25 14:42:33,361 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.69 vs. limit=12.0 2023-06-25 14:42:37,751 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1355532.0, ans=0.1 2023-06-25 14:43:01,597 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1355592.0, ans=0.125 2023-06-25 14:43:18,036 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1355652.0, ans=0.125 2023-06-25 14:43:46,811 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.84 vs. limit=22.5 2023-06-25 14:43:55,912 INFO [train.py:996] (2/4) Epoch 8, batch 12500, loss[loss=0.2505, simple_loss=0.3417, pruned_loss=0.07966, over 21316.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.308, pruned_loss=0.07381, over 4285219.41 frames. ], batch size: 176, lr: 3.75e-03, grad_scale: 16.0 2023-06-25 14:44:02,978 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.309e+02 4.292e+02 5.906e+02 9.269e+02 3.047e+03, threshold=1.181e+03, percent-clipped=14.0 2023-06-25 14:45:26,164 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1355952.0, ans=0.125 2023-06-25 14:45:47,022 INFO [train.py:996] (2/4) Epoch 8, batch 12550, loss[loss=0.2313, simple_loss=0.3117, pruned_loss=0.07541, over 21407.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.3116, pruned_loss=0.07537, over 4285033.98 frames. ], batch size: 211, lr: 3.75e-03, grad_scale: 16.0 2023-06-25 14:45:53,699 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1356072.0, ans=0.0 2023-06-25 14:46:04,232 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1356072.0, ans=0.125 2023-06-25 14:46:50,797 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1356192.0, ans=0.0 2023-06-25 14:47:42,217 INFO [train.py:996] (2/4) Epoch 8, batch 12600, loss[loss=0.2083, simple_loss=0.3038, pruned_loss=0.0564, over 21578.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.3105, pruned_loss=0.07281, over 4280261.06 frames. ], batch size: 389, lr: 3.75e-03, grad_scale: 16.0 2023-06-25 14:47:48,696 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.799e+02 4.195e+02 5.786e+02 8.769e+02 1.751e+03, threshold=1.157e+03, percent-clipped=8.0 2023-06-25 14:48:24,133 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1356432.0, ans=0.0 2023-06-25 14:49:11,057 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.15 vs. limit=12.0 2023-06-25 14:49:23,545 INFO [train.py:996] (2/4) Epoch 8, batch 12650, loss[loss=0.2165, simple_loss=0.2904, pruned_loss=0.07137, over 21711.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.3029, pruned_loss=0.06907, over 4272190.96 frames. ], batch size: 389, lr: 3.75e-03, grad_scale: 16.0 2023-06-25 14:50:41,322 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1356852.0, ans=0.2 2023-06-25 14:50:55,955 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1356912.0, ans=0.5 2023-06-25 14:51:19,789 INFO [train.py:996] (2/4) Epoch 8, batch 12700, loss[loss=0.2311, simple_loss=0.3042, pruned_loss=0.07905, over 21498.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.3026, pruned_loss=0.07203, over 4275278.15 frames. ], batch size: 211, lr: 3.75e-03, grad_scale: 16.0 2023-06-25 14:51:32,488 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.627e+02 4.265e+02 5.595e+02 7.381e+02 1.572e+03, threshold=1.119e+03, percent-clipped=3.0 2023-06-25 14:51:39,962 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1356972.0, ans=0.1 2023-06-25 14:51:46,932 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1357032.0, ans=0.125 2023-06-25 14:52:01,273 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=14.39 vs. limit=15.0 2023-06-25 14:52:03,712 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1357092.0, ans=0.125 2023-06-25 14:52:23,391 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1357152.0, ans=0.04949747468305833 2023-06-25 14:52:34,574 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.81 vs. limit=12.0 2023-06-25 14:52:34,600 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.06 vs. limit=22.5 2023-06-25 14:53:02,647 INFO [train.py:996] (2/4) Epoch 8, batch 12750, loss[loss=0.2256, simple_loss=0.3013, pruned_loss=0.07498, over 21774.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.3043, pruned_loss=0.07225, over 4272581.48 frames. ], batch size: 112, lr: 3.75e-03, grad_scale: 16.0 2023-06-25 14:53:24,135 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1357332.0, ans=0.125 2023-06-25 14:53:34,575 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1357332.0, ans=10.0 2023-06-25 14:54:18,910 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1357452.0, ans=0.0 2023-06-25 14:54:25,654 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1357512.0, ans=0.0 2023-06-25 14:54:36,696 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 14:54:37,232 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.76 vs. limit=10.0 2023-06-25 14:54:55,646 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1357572.0, ans=0.0 2023-06-25 14:54:57,139 INFO [train.py:996] (2/4) Epoch 8, batch 12800, loss[loss=0.2215, simple_loss=0.2959, pruned_loss=0.07357, over 21932.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.3039, pruned_loss=0.07246, over 4269801.73 frames. ], batch size: 316, lr: 3.75e-03, grad_scale: 32.0 2023-06-25 14:55:04,033 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.881e+02 3.698e+02 4.519e+02 5.409e+02 8.581e+02, threshold=9.039e+02, percent-clipped=0.0 2023-06-25 14:55:11,962 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1357572.0, ans=0.125 2023-06-25 14:55:28,000 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1357632.0, ans=0.0 2023-06-25 14:55:44,117 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1357692.0, ans=0.2 2023-06-25 14:56:38,747 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1357812.0, ans=0.0 2023-06-25 14:56:47,894 INFO [train.py:996] (2/4) Epoch 8, batch 12850, loss[loss=0.1921, simple_loss=0.2967, pruned_loss=0.04373, over 21770.00 frames. ], tot_loss[loss=0.2266, simple_loss=0.3054, pruned_loss=0.07387, over 4275733.85 frames. ], batch size: 332, lr: 3.75e-03, grad_scale: 32.0 2023-06-25 14:57:07,489 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.82 vs. limit=6.0 2023-06-25 14:57:09,354 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.16 vs. limit=10.0 2023-06-25 14:57:37,598 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1357992.0, ans=0.1 2023-06-25 14:58:40,146 INFO [train.py:996] (2/4) Epoch 8, batch 12900, loss[loss=0.178, simple_loss=0.2573, pruned_loss=0.0494, over 21144.00 frames. ], tot_loss[loss=0.2219, simple_loss=0.3023, pruned_loss=0.07069, over 4275181.43 frames. ], batch size: 176, lr: 3.75e-03, grad_scale: 32.0 2023-06-25 14:58:47,393 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.654e+02 3.588e+02 4.373e+02 7.155e+02 1.857e+03, threshold=8.745e+02, percent-clipped=14.0 2023-06-25 14:59:11,642 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1358232.0, ans=0.125 2023-06-25 14:59:11,729 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1358232.0, ans=0.125 2023-06-25 14:59:27,636 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1358292.0, ans=0.1 2023-06-25 14:59:43,450 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.07 vs. limit=12.0 2023-06-25 15:00:16,644 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1358412.0, ans=0.025 2023-06-25 15:00:24,727 INFO [train.py:996] (2/4) Epoch 8, batch 12950, loss[loss=0.1789, simple_loss=0.2616, pruned_loss=0.04812, over 21649.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.303, pruned_loss=0.06925, over 4268603.29 frames. ], batch size: 263, lr: 3.75e-03, grad_scale: 16.0 2023-06-25 15:01:42,606 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1358652.0, ans=0.09899494936611666 2023-06-25 15:01:52,128 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.33 vs. limit=22.5 2023-06-25 15:02:14,920 INFO [train.py:996] (2/4) Epoch 8, batch 13000, loss[loss=0.2614, simple_loss=0.3639, pruned_loss=0.07945, over 19794.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.3049, pruned_loss=0.06999, over 4266084.42 frames. ], batch size: 702, lr: 3.75e-03, grad_scale: 16.0 2023-06-25 15:02:23,110 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.358e+02 3.843e+02 4.886e+02 6.754e+02 1.173e+03, threshold=9.772e+02, percent-clipped=9.0 2023-06-25 15:02:27,183 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1358772.0, ans=0.125 2023-06-25 15:02:38,987 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1358832.0, ans=0.125 2023-06-25 15:02:39,129 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1358832.0, ans=0.125 2023-06-25 15:02:44,427 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1358832.0, ans=0.0 2023-06-25 15:03:13,123 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.31 vs. limit=15.0 2023-06-25 15:03:18,941 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1358952.0, ans=0.125 2023-06-25 15:03:47,468 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1359012.0, ans=0.125 2023-06-25 15:03:57,489 INFO [train.py:996] (2/4) Epoch 8, batch 13050, loss[loss=0.179, simple_loss=0.2213, pruned_loss=0.06837, over 19960.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.3007, pruned_loss=0.0685, over 4261702.19 frames. ], batch size: 704, lr: 3.75e-03, grad_scale: 16.0 2023-06-25 15:04:06,358 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1359072.0, ans=0.125 2023-06-25 15:04:24,093 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1359132.0, ans=0.015 2023-06-25 15:04:48,171 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1359192.0, ans=0.0 2023-06-25 15:05:08,954 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1359252.0, ans=0.0 2023-06-25 15:05:41,282 INFO [train.py:996] (2/4) Epoch 8, batch 13100, loss[loss=0.1994, simple_loss=0.2943, pruned_loss=0.05226, over 21703.00 frames. ], tot_loss[loss=0.2183, simple_loss=0.3002, pruned_loss=0.06818, over 4270884.32 frames. ], batch size: 298, lr: 3.75e-03, grad_scale: 16.0 2023-06-25 15:05:47,033 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1359372.0, ans=0.125 2023-06-25 15:05:50,178 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.856e+02 3.427e+02 4.465e+02 6.179e+02 1.477e+03, threshold=8.931e+02, percent-clipped=2.0 2023-06-25 15:06:27,127 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1359432.0, ans=0.0 2023-06-25 15:07:19,751 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1359612.0, ans=0.04949747468305833 2023-06-25 15:07:31,676 INFO [train.py:996] (2/4) Epoch 8, batch 13150, loss[loss=0.2368, simple_loss=0.3185, pruned_loss=0.07753, over 21437.00 frames. ], tot_loss[loss=0.2214, simple_loss=0.302, pruned_loss=0.07035, over 4275668.19 frames. ], batch size: 211, lr: 3.75e-03, grad_scale: 16.0 2023-06-25 15:07:45,135 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1359672.0, ans=0.0 2023-06-25 15:07:45,601 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.99 vs. limit=22.5 2023-06-25 15:07:55,380 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1359672.0, ans=0.0 2023-06-25 15:09:25,946 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1359972.0, ans=0.125 2023-06-25 15:09:27,382 INFO [train.py:996] (2/4) Epoch 8, batch 13200, loss[loss=0.2111, simple_loss=0.2768, pruned_loss=0.07266, over 21224.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.3008, pruned_loss=0.07022, over 4270802.76 frames. ], batch size: 608, lr: 3.75e-03, grad_scale: 32.0 2023-06-25 15:09:46,143 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.744e+02 3.706e+02 4.388e+02 6.661e+02 1.084e+03, threshold=8.775e+02, percent-clipped=9.0 2023-06-25 15:11:21,331 INFO [train.py:996] (2/4) Epoch 8, batch 13250, loss[loss=0.2169, simple_loss=0.2874, pruned_loss=0.07318, over 21657.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.3002, pruned_loss=0.07193, over 4269129.07 frames. ], batch size: 230, lr: 3.75e-03, grad_scale: 16.0 2023-06-25 15:11:48,681 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1360332.0, ans=0.125 2023-06-25 15:12:14,198 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1360392.0, ans=0.05 2023-06-25 15:13:18,500 INFO [train.py:996] (2/4) Epoch 8, batch 13300, loss[loss=0.2101, simple_loss=0.3248, pruned_loss=0.04766, over 19794.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.3033, pruned_loss=0.0713, over 4268363.74 frames. ], batch size: 702, lr: 3.75e-03, grad_scale: 16.0 2023-06-25 15:13:19,066 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1360572.0, ans=0.125 2023-06-25 15:13:34,397 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.430e+02 3.717e+02 5.105e+02 6.593e+02 1.654e+03, threshold=1.021e+03, percent-clipped=11.0 2023-06-25 15:14:15,588 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1360692.0, ans=0.05 2023-06-25 15:14:26,230 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1360752.0, ans=0.0 2023-06-25 15:15:08,177 INFO [train.py:996] (2/4) Epoch 8, batch 13350, loss[loss=0.2382, simple_loss=0.3216, pruned_loss=0.07737, over 21766.00 frames. ], tot_loss[loss=0.2271, simple_loss=0.3065, pruned_loss=0.07383, over 4271152.18 frames. ], batch size: 298, lr: 3.74e-03, grad_scale: 16.0 2023-06-25 15:15:09,041 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1360872.0, ans=0.0 2023-06-25 15:15:33,985 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1360932.0, ans=0.125 2023-06-25 15:16:44,395 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 15:16:46,272 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1361112.0, ans=0.125 2023-06-25 15:16:50,294 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.14 vs. limit=15.0 2023-06-25 15:17:03,314 INFO [train.py:996] (2/4) Epoch 8, batch 13400, loss[loss=0.2332, simple_loss=0.3082, pruned_loss=0.07912, over 21460.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3072, pruned_loss=0.07574, over 4274565.23 frames. ], batch size: 194, lr: 3.74e-03, grad_scale: 16.0 2023-06-25 15:17:13,988 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.122e+02 3.939e+02 4.986e+02 7.057e+02 1.760e+03, threshold=9.973e+02, percent-clipped=5.0 2023-06-25 15:17:19,073 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.88 vs. limit=22.5 2023-06-25 15:17:20,588 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1361232.0, ans=0.2 2023-06-25 15:17:38,330 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1361292.0, ans=0.1 2023-06-25 15:18:41,747 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.40 vs. limit=15.0 2023-06-25 15:18:52,667 INFO [train.py:996] (2/4) Epoch 8, batch 13450, loss[loss=0.1788, simple_loss=0.2445, pruned_loss=0.05656, over 21525.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.3074, pruned_loss=0.07717, over 4277580.70 frames. ], batch size: 195, lr: 3.74e-03, grad_scale: 16.0 2023-06-25 15:19:10,841 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1361532.0, ans=0.2 2023-06-25 15:19:12,630 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1361532.0, ans=0.125 2023-06-25 15:19:41,664 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.13 vs. limit=10.0 2023-06-25 15:20:01,735 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1361652.0, ans=0.125 2023-06-25 15:20:03,490 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1361652.0, ans=0.0 2023-06-25 15:20:41,861 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1361772.0, ans=0.025 2023-06-25 15:20:42,810 INFO [train.py:996] (2/4) Epoch 8, batch 13500, loss[loss=0.2377, simple_loss=0.3193, pruned_loss=0.07808, over 21284.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.2985, pruned_loss=0.0743, over 4268237.50 frames. ], batch size: 549, lr: 3.74e-03, grad_scale: 16.0 2023-06-25 15:20:47,044 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1361772.0, ans=0.0 2023-06-25 15:20:53,701 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.704e+02 3.900e+02 4.940e+02 7.289e+02 1.559e+03, threshold=9.879e+02, percent-clipped=7.0 2023-06-25 15:21:22,154 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1361832.0, ans=0.125 2023-06-25 15:21:36,262 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1361892.0, ans=0.125 2023-06-25 15:21:46,942 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1361892.0, ans=0.125 2023-06-25 15:21:53,215 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.85 vs. limit=6.0 2023-06-25 15:22:21,723 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.31 vs. limit=15.0 2023-06-25 15:22:24,632 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1362012.0, ans=0.125 2023-06-25 15:22:34,508 INFO [train.py:996] (2/4) Epoch 8, batch 13550, loss[loss=0.1793, simple_loss=0.2537, pruned_loss=0.05247, over 21748.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.3028, pruned_loss=0.07288, over 4266066.20 frames. ], batch size: 112, lr: 3.74e-03, grad_scale: 16.0 2023-06-25 15:22:34,989 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1362072.0, ans=0.0 2023-06-25 15:22:38,741 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1362072.0, ans=0.125 2023-06-25 15:23:16,841 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1362132.0, ans=0.0 2023-06-25 15:23:25,813 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.62 vs. limit=6.0 2023-06-25 15:23:30,434 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1362192.0, ans=0.125 2023-06-25 15:23:30,482 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1362192.0, ans=0.2 2023-06-25 15:23:59,730 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.06 vs. limit=10.0 2023-06-25 15:24:18,232 INFO [train.py:996] (2/4) Epoch 8, batch 13600, loss[loss=0.198, simple_loss=0.2906, pruned_loss=0.05267, over 21826.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.3029, pruned_loss=0.07285, over 4280943.75 frames. ], batch size: 124, lr: 3.74e-03, grad_scale: 32.0 2023-06-25 15:24:28,505 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.890e+02 3.859e+02 5.232e+02 7.287e+02 1.567e+03, threshold=1.046e+03, percent-clipped=12.0 2023-06-25 15:25:00,110 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.31 vs. limit=15.0 2023-06-25 15:25:29,187 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1362552.0, ans=0.125 2023-06-25 15:25:40,948 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1362552.0, ans=0.1 2023-06-25 15:25:42,585 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1362612.0, ans=10.0 2023-06-25 15:26:01,167 INFO [train.py:996] (2/4) Epoch 8, batch 13650, loss[loss=0.1686, simple_loss=0.2372, pruned_loss=0.05001, over 21564.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.2971, pruned_loss=0.07023, over 4276320.68 frames. ], batch size: 263, lr: 3.74e-03, grad_scale: 16.0 2023-06-25 15:27:50,074 INFO [train.py:996] (2/4) Epoch 8, batch 13700, loss[loss=0.2025, simple_loss=0.2705, pruned_loss=0.06718, over 21668.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2923, pruned_loss=0.06958, over 4262124.02 frames. ], batch size: 263, lr: 3.74e-03, grad_scale: 16.0 2023-06-25 15:28:08,788 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.733e+02 3.641e+02 4.705e+02 7.070e+02 1.116e+03, threshold=9.410e+02, percent-clipped=4.0 2023-06-25 15:29:07,519 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.44 vs. limit=12.0 2023-06-25 15:29:46,969 INFO [train.py:996] (2/4) Epoch 8, batch 13750, loss[loss=0.2174, simple_loss=0.3, pruned_loss=0.06735, over 21681.00 frames. ], tot_loss[loss=0.2164, simple_loss=0.2927, pruned_loss=0.07005, over 4266047.39 frames. ], batch size: 351, lr: 3.74e-03, grad_scale: 16.0 2023-06-25 15:29:55,467 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1363272.0, ans=0.0 2023-06-25 15:30:31,098 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1363332.0, ans=0.125 2023-06-25 15:31:30,077 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1363512.0, ans=0.0 2023-06-25 15:31:42,860 INFO [train.py:996] (2/4) Epoch 8, batch 13800, loss[loss=0.2117, simple_loss=0.2985, pruned_loss=0.06251, over 21447.00 frames. ], tot_loss[loss=0.2183, simple_loss=0.2973, pruned_loss=0.06972, over 4262333.16 frames. ], batch size: 194, lr: 3.74e-03, grad_scale: 16.0 2023-06-25 15:31:59,939 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1363572.0, ans=0.125 2023-06-25 15:32:00,802 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.906e+02 4.517e+02 6.756e+02 9.995e+02 2.111e+03, threshold=1.351e+03, percent-clipped=26.0 2023-06-25 15:32:04,258 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.75 vs. limit=15.0 2023-06-25 15:32:31,599 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1363692.0, ans=0.0 2023-06-25 15:32:57,602 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1363752.0, ans=0.125 2023-06-25 15:33:04,969 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1363812.0, ans=0.125 2023-06-25 15:33:33,392 INFO [train.py:996] (2/4) Epoch 8, batch 13850, loss[loss=0.2211, simple_loss=0.3155, pruned_loss=0.06334, over 21675.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.3034, pruned_loss=0.07089, over 4264136.07 frames. ], batch size: 263, lr: 3.74e-03, grad_scale: 16.0 2023-06-25 15:33:46,541 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1363872.0, ans=0.125 2023-06-25 15:34:00,602 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1363932.0, ans=0.04949747468305833 2023-06-25 15:34:07,253 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1363992.0, ans=0.0 2023-06-25 15:34:10,826 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1363992.0, ans=0.0 2023-06-25 15:35:20,939 INFO [train.py:996] (2/4) Epoch 8, batch 13900, loss[loss=0.2262, simple_loss=0.3098, pruned_loss=0.07131, over 21974.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.3075, pruned_loss=0.07374, over 4271816.85 frames. ], batch size: 113, lr: 3.74e-03, grad_scale: 16.0 2023-06-25 15:35:22,284 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=16.30 vs. limit=22.5 2023-06-25 15:35:32,361 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1364172.0, ans=0.0 2023-06-25 15:35:33,174 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.067e+02 4.054e+02 4.959e+02 6.399e+02 1.364e+03, threshold=9.918e+02, percent-clipped=1.0 2023-06-25 15:35:49,318 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1364232.0, ans=0.125 2023-06-25 15:35:53,719 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.47 vs. limit=15.0 2023-06-25 15:36:00,405 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1364292.0, ans=0.125 2023-06-25 15:36:54,175 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1364412.0, ans=0.125 2023-06-25 15:37:09,469 INFO [train.py:996] (2/4) Epoch 8, batch 13950, loss[loss=0.2219, simple_loss=0.3502, pruned_loss=0.04683, over 19752.00 frames. ], tot_loss[loss=0.2276, simple_loss=0.3062, pruned_loss=0.07448, over 4271448.08 frames. ], batch size: 702, lr: 3.74e-03, grad_scale: 16.0 2023-06-25 15:37:29,489 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1364532.0, ans=0.125 2023-06-25 15:38:57,953 INFO [train.py:996] (2/4) Epoch 8, batch 14000, loss[loss=0.2326, simple_loss=0.3218, pruned_loss=0.07166, over 21559.00 frames. ], tot_loss[loss=0.222, simple_loss=0.3007, pruned_loss=0.07166, over 4258541.67 frames. ], batch size: 471, lr: 3.74e-03, grad_scale: 32.0 2023-06-25 15:39:09,933 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.689e+02 3.751e+02 4.894e+02 7.186e+02 1.368e+03, threshold=9.787e+02, percent-clipped=13.0 2023-06-25 15:39:14,418 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1364832.0, ans=0.2 2023-06-25 15:39:17,733 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1364832.0, ans=0.125 2023-06-25 15:39:34,715 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1364892.0, ans=0.125 2023-06-25 15:40:24,507 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.24 vs. limit=6.0 2023-06-25 15:40:25,735 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff3.min_abs, batch_count=1365012.0, ans=0.2 2023-06-25 15:40:45,640 INFO [train.py:996] (2/4) Epoch 8, batch 14050, loss[loss=0.2276, simple_loss=0.2847, pruned_loss=0.08527, over 20176.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2958, pruned_loss=0.06841, over 4250320.84 frames. ], batch size: 703, lr: 3.74e-03, grad_scale: 32.0 2023-06-25 15:40:51,346 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1365072.0, ans=0.125 2023-06-25 15:42:00,852 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1365252.0, ans=0.0 2023-06-25 15:42:05,843 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 15:42:33,618 INFO [train.py:996] (2/4) Epoch 8, batch 14100, loss[loss=0.2097, simple_loss=0.2806, pruned_loss=0.06934, over 21576.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2908, pruned_loss=0.06886, over 4245296.53 frames. ], batch size: 263, lr: 3.74e-03, grad_scale: 16.0 2023-06-25 15:42:39,579 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1365372.0, ans=0.0 2023-06-25 15:42:47,585 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.656e+02 3.476e+02 4.443e+02 5.620e+02 1.211e+03, threshold=8.886e+02, percent-clipped=2.0 2023-06-25 15:43:15,603 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1365492.0, ans=0.125 2023-06-25 15:43:17,486 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1365492.0, ans=0.125 2023-06-25 15:43:57,132 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1365612.0, ans=0.125 2023-06-25 15:44:19,903 INFO [train.py:996] (2/4) Epoch 8, batch 14150, loss[loss=0.2101, simple_loss=0.3043, pruned_loss=0.05792, over 21714.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2953, pruned_loss=0.06977, over 4251627.42 frames. ], batch size: 298, lr: 3.74e-03, grad_scale: 16.0 2023-06-25 15:45:25,737 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1365852.0, ans=0.2 2023-06-25 15:45:36,077 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1365852.0, ans=0.125 2023-06-25 15:46:01,163 INFO [train.py:996] (2/4) Epoch 8, batch 14200, loss[loss=0.2129, simple_loss=0.2925, pruned_loss=0.06665, over 21677.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2948, pruned_loss=0.06876, over 4243467.83 frames. ], batch size: 230, lr: 3.74e-03, grad_scale: 16.0 2023-06-25 15:46:20,191 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.681e+02 4.879e+02 7.691e+02 1.070e+03 2.190e+03, threshold=1.538e+03, percent-clipped=38.0 2023-06-25 15:46:34,762 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1366032.0, ans=0.125 2023-06-25 15:46:38,939 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.96 vs. limit=10.0 2023-06-25 15:47:31,933 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1366212.0, ans=0.0 2023-06-25 15:47:44,719 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.17 vs. limit=12.0 2023-06-25 15:47:46,311 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1366212.0, ans=0.125 2023-06-25 15:47:49,401 INFO [train.py:996] (2/4) Epoch 8, batch 14250, loss[loss=0.2081, simple_loss=0.2805, pruned_loss=0.06786, over 21870.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2909, pruned_loss=0.0688, over 4251911.30 frames. ], batch size: 98, lr: 3.74e-03, grad_scale: 16.0 2023-06-25 15:48:11,849 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.14 vs. limit=15.0 2023-06-25 15:48:16,357 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1366332.0, ans=0.0 2023-06-25 15:48:30,940 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1366392.0, ans=0.0 2023-06-25 15:49:05,387 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1366452.0, ans=0.125 2023-06-25 15:49:20,859 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1366512.0, ans=0.0 2023-06-25 15:49:20,885 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1366512.0, ans=0.1 2023-06-25 15:49:29,681 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1366512.0, ans=0.125 2023-06-25 15:49:33,992 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.90 vs. limit=22.5 2023-06-25 15:49:39,579 INFO [train.py:996] (2/4) Epoch 8, batch 14300, loss[loss=0.3056, simple_loss=0.4124, pruned_loss=0.09938, over 21228.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.2938, pruned_loss=0.06926, over 4246613.48 frames. ], batch size: 549, lr: 3.74e-03, grad_scale: 16.0 2023-06-25 15:49:59,652 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.539e+02 3.382e+02 4.720e+02 7.552e+02 1.673e+03, threshold=9.439e+02, percent-clipped=2.0 2023-06-25 15:50:24,027 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.41 vs. limit=15.0 2023-06-25 15:50:44,186 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 15:51:23,213 INFO [train.py:996] (2/4) Epoch 8, batch 14350, loss[loss=0.2093, simple_loss=0.2883, pruned_loss=0.06517, over 21874.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.2999, pruned_loss=0.0698, over 4240654.77 frames. ], batch size: 371, lr: 3.74e-03, grad_scale: 16.0 2023-06-25 15:52:24,806 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1366992.0, ans=0.125 2023-06-25 15:52:58,815 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1367112.0, ans=0.07 2023-06-25 15:53:07,458 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1367112.0, ans=0.125 2023-06-25 15:53:10,730 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1367172.0, ans=0.125 2023-06-25 15:53:17,520 INFO [train.py:996] (2/4) Epoch 8, batch 14400, loss[loss=0.1919, simple_loss=0.2646, pruned_loss=0.05954, over 21773.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.297, pruned_loss=0.07021, over 4236740.50 frames. ], batch size: 316, lr: 3.74e-03, grad_scale: 32.0 2023-06-25 15:53:24,724 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1367172.0, ans=0.0 2023-06-25 15:53:30,794 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.862e+02 3.850e+02 4.891e+02 6.324e+02 1.594e+03, threshold=9.783e+02, percent-clipped=6.0 2023-06-25 15:54:19,155 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.51 vs. limit=22.5 2023-06-25 15:54:39,826 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.95 vs. limit=22.5 2023-06-25 15:54:41,814 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.17 vs. limit=6.0 2023-06-25 15:54:53,564 INFO [train.py:996] (2/4) Epoch 8, batch 14450, loss[loss=0.1954, simple_loss=0.2631, pruned_loss=0.0638, over 21764.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2908, pruned_loss=0.06994, over 4238616.34 frames. ], batch size: 316, lr: 3.74e-03, grad_scale: 32.0 2023-06-25 15:56:08,471 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1367652.0, ans=0.0 2023-06-25 15:56:40,148 INFO [train.py:996] (2/4) Epoch 8, batch 14500, loss[loss=0.2013, simple_loss=0.2793, pruned_loss=0.06168, over 21170.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.2861, pruned_loss=0.06949, over 4249001.14 frames. ], batch size: 143, lr: 3.74e-03, grad_scale: 16.0 2023-06-25 15:56:40,826 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1367772.0, ans=0.125 2023-06-25 15:56:40,958 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1367772.0, ans=0.125 2023-06-25 15:56:42,691 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1367772.0, ans=0.125 2023-06-25 15:57:02,149 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.716e+02 3.452e+02 4.183e+02 6.174e+02 1.088e+03, threshold=8.366e+02, percent-clipped=1.0 2023-06-25 15:57:33,839 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1367892.0, ans=0.125 2023-06-25 15:58:11,217 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1368012.0, ans=0.125 2023-06-25 15:58:33,916 INFO [train.py:996] (2/4) Epoch 8, batch 14550, loss[loss=0.248, simple_loss=0.3262, pruned_loss=0.08494, over 21721.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2896, pruned_loss=0.07086, over 4255632.59 frames. ], batch size: 298, lr: 3.73e-03, grad_scale: 16.0 2023-06-25 15:59:11,484 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 15:59:28,699 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1368192.0, ans=0.2 2023-06-25 15:59:34,140 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1368192.0, ans=0.0 2023-06-25 15:59:38,190 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1368192.0, ans=0.0 2023-06-25 16:00:22,837 INFO [train.py:996] (2/4) Epoch 8, batch 14600, loss[loss=0.2293, simple_loss=0.3215, pruned_loss=0.06853, over 21734.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.2971, pruned_loss=0.07413, over 4260257.58 frames. ], batch size: 247, lr: 3.73e-03, grad_scale: 16.0 2023-06-25 16:00:38,185 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.261e+02 4.717e+02 6.049e+02 8.556e+02 1.756e+03, threshold=1.210e+03, percent-clipped=27.0 2023-06-25 16:01:03,091 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1368492.0, ans=0.125 2023-06-25 16:01:59,698 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.75 vs. limit=22.5 2023-06-25 16:02:10,910 INFO [train.py:996] (2/4) Epoch 8, batch 14650, loss[loss=0.2108, simple_loss=0.2911, pruned_loss=0.06522, over 21757.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.3034, pruned_loss=0.07508, over 4260263.52 frames. ], batch size: 112, lr: 3.73e-03, grad_scale: 16.0 2023-06-25 16:03:13,641 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1368792.0, ans=0.2 2023-06-25 16:03:20,705 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1368852.0, ans=0.125 2023-06-25 16:03:58,479 INFO [train.py:996] (2/4) Epoch 8, batch 14700, loss[loss=0.1939, simple_loss=0.2833, pruned_loss=0.05223, over 21502.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.2992, pruned_loss=0.06994, over 4265249.05 frames. ], batch size: 471, lr: 3.73e-03, grad_scale: 16.0 2023-06-25 16:04:14,433 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.383e+02 3.677e+02 4.958e+02 7.109e+02 1.155e+03, threshold=9.917e+02, percent-clipped=0.0 2023-06-25 16:05:50,290 INFO [train.py:996] (2/4) Epoch 8, batch 14750, loss[loss=0.1884, simple_loss=0.2734, pruned_loss=0.05173, over 21384.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.303, pruned_loss=0.07173, over 4267300.30 frames. ], batch size: 194, lr: 3.73e-03, grad_scale: 16.0 2023-06-25 16:06:22,943 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1369332.0, ans=0.125 2023-06-25 16:06:29,829 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1369332.0, ans=0.125 2023-06-25 16:06:35,768 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.22 vs. limit=12.0 2023-06-25 16:06:40,645 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1369332.0, ans=0.125 2023-06-25 16:07:00,615 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.80 vs. limit=22.5 2023-06-25 16:07:23,609 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1369512.0, ans=0.125 2023-06-25 16:07:35,620 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1369512.0, ans=0.125 2023-06-25 16:07:35,687 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1369512.0, ans=0.125 2023-06-25 16:07:47,170 INFO [train.py:996] (2/4) Epoch 8, batch 14800, loss[loss=0.3514, simple_loss=0.4007, pruned_loss=0.1511, over 21401.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.3131, pruned_loss=0.07702, over 4262556.92 frames. ], batch size: 507, lr: 3.73e-03, grad_scale: 32.0 2023-06-25 16:08:12,803 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.189e+02 4.811e+02 6.847e+02 1.023e+03 2.171e+03, threshold=1.369e+03, percent-clipped=26.0 2023-06-25 16:08:31,511 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1369692.0, ans=0.0 2023-06-25 16:08:33,247 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1369692.0, ans=0.125 2023-06-25 16:09:15,045 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1369812.0, ans=0.125 2023-06-25 16:09:40,193 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1369812.0, ans=0.125 2023-06-25 16:09:43,099 INFO [train.py:996] (2/4) Epoch 8, batch 14850, loss[loss=0.1936, simple_loss=0.2645, pruned_loss=0.06137, over 21851.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.3067, pruned_loss=0.07629, over 4266002.96 frames. ], batch size: 107, lr: 3.73e-03, grad_scale: 32.0 2023-06-25 16:10:19,819 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1369932.0, ans=0.0 2023-06-25 16:10:23,135 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1369992.0, ans=0.125 2023-06-25 16:11:39,839 INFO [train.py:996] (2/4) Epoch 8, batch 14900, loss[loss=0.2363, simple_loss=0.3108, pruned_loss=0.08088, over 21832.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.3087, pruned_loss=0.07796, over 4270137.61 frames. ], batch size: 282, lr: 3.73e-03, grad_scale: 16.0 2023-06-25 16:11:57,473 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.090e+02 4.206e+02 5.469e+02 8.347e+02 1.577e+03, threshold=1.094e+03, percent-clipped=2.0 2023-06-25 16:12:12,709 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1370232.0, ans=0.125 2023-06-25 16:13:30,838 INFO [train.py:996] (2/4) Epoch 8, batch 14950, loss[loss=0.2228, simple_loss=0.3154, pruned_loss=0.06508, over 21925.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3097, pruned_loss=0.07719, over 4277221.93 frames. ], batch size: 373, lr: 3.73e-03, grad_scale: 16.0 2023-06-25 16:13:57,049 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1370532.0, ans=0.2 2023-06-25 16:14:11,191 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1370592.0, ans=0.125 2023-06-25 16:14:21,363 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1370592.0, ans=0.2 2023-06-25 16:15:06,781 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1370712.0, ans=0.0 2023-06-25 16:15:11,897 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1370712.0, ans=0.1 2023-06-25 16:15:11,966 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1370712.0, ans=0.2 2023-06-25 16:15:19,935 INFO [train.py:996] (2/4) Epoch 8, batch 15000, loss[loss=0.2404, simple_loss=0.3171, pruned_loss=0.08182, over 21689.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3111, pruned_loss=0.07795, over 4279424.87 frames. ], batch size: 389, lr: 3.73e-03, grad_scale: 16.0 2023-06-25 16:15:19,936 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-25 16:15:40,730 INFO [train.py:1028] (2/4) Epoch 8, validation: loss=0.2554, simple_loss=0.3473, pruned_loss=0.08173, over 1796401.00 frames. 2023-06-25 16:15:40,732 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 23731MB 2023-06-25 16:15:49,002 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1370772.0, ans=0.125 2023-06-25 16:15:54,010 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1370772.0, ans=0.1 2023-06-25 16:15:58,827 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.944e+02 3.850e+02 4.977e+02 6.696e+02 1.113e+03, threshold=9.953e+02, percent-clipped=2.0 2023-06-25 16:16:32,921 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1370892.0, ans=0.0 2023-06-25 16:16:56,345 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.37 vs. limit=15.0 2023-06-25 16:17:30,920 INFO [train.py:996] (2/4) Epoch 8, batch 15050, loss[loss=0.2227, simple_loss=0.306, pruned_loss=0.06972, over 21435.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.3129, pruned_loss=0.07907, over 4281384.23 frames. ], batch size: 211, lr: 3.73e-03, grad_scale: 16.0 2023-06-25 16:17:42,303 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1371072.0, ans=0.1 2023-06-25 16:18:18,374 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1371192.0, ans=0.125 2023-06-25 16:18:19,101 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=3.80 vs. limit=15.0 2023-06-25 16:18:29,744 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1371192.0, ans=10.0 2023-06-25 16:18:40,330 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1371252.0, ans=0.0 2023-06-25 16:18:45,383 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1371252.0, ans=0.0 2023-06-25 16:18:56,825 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.69 vs. limit=15.0 2023-06-25 16:19:20,733 INFO [train.py:996] (2/4) Epoch 8, batch 15100, loss[loss=0.1961, simple_loss=0.274, pruned_loss=0.05905, over 21619.00 frames. ], tot_loss[loss=0.236, simple_loss=0.3153, pruned_loss=0.07837, over 4283453.19 frames. ], batch size: 112, lr: 3.73e-03, grad_scale: 16.0 2023-06-25 16:19:43,662 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.981e+02 4.480e+02 6.447e+02 8.808e+02 1.442e+03, threshold=1.289e+03, percent-clipped=16.0 2023-06-25 16:19:56,996 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1371432.0, ans=0.0 2023-06-25 16:20:49,552 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1371552.0, ans=0.125 2023-06-25 16:20:59,877 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1371612.0, ans=0.125 2023-06-25 16:21:09,586 INFO [train.py:996] (2/4) Epoch 8, batch 15150, loss[loss=0.1904, simple_loss=0.2566, pruned_loss=0.06208, over 21623.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.3095, pruned_loss=0.07757, over 4283970.41 frames. ], batch size: 298, lr: 3.73e-03, grad_scale: 16.0 2023-06-25 16:22:18,383 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1371852.0, ans=0.125 2023-06-25 16:22:57,840 INFO [train.py:996] (2/4) Epoch 8, batch 15200, loss[loss=0.2431, simple_loss=0.3152, pruned_loss=0.08548, over 21387.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.3008, pruned_loss=0.07419, over 4287748.00 frames. ], batch size: 507, lr: 3.73e-03, grad_scale: 32.0 2023-06-25 16:23:25,479 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1372032.0, ans=0.0 2023-06-25 16:23:26,366 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.586e+02 3.888e+02 5.742e+02 8.749e+02 1.820e+03, threshold=1.148e+03, percent-clipped=6.0 2023-06-25 16:24:05,927 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.53 vs. limit=15.0 2023-06-25 16:24:52,309 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.44 vs. limit=15.0 2023-06-25 16:24:52,722 INFO [train.py:996] (2/4) Epoch 8, batch 15250, loss[loss=0.2405, simple_loss=0.2983, pruned_loss=0.09135, over 21612.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.2958, pruned_loss=0.07265, over 4274763.56 frames. ], batch size: 441, lr: 3.73e-03, grad_scale: 16.0 2023-06-25 16:25:25,528 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.44 vs. limit=15.0 2023-06-25 16:25:34,946 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 16:26:07,427 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1372452.0, ans=0.125 2023-06-25 16:26:11,521 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.79 vs. limit=10.0 2023-06-25 16:26:18,287 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1372512.0, ans=0.0 2023-06-25 16:26:48,263 INFO [train.py:996] (2/4) Epoch 8, batch 15300, loss[loss=0.2301, simple_loss=0.3106, pruned_loss=0.07478, over 22019.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.2982, pruned_loss=0.07482, over 4275599.57 frames. ], batch size: 317, lr: 3.73e-03, grad_scale: 16.0 2023-06-25 16:27:11,780 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1372632.0, ans=0.1 2023-06-25 16:27:12,649 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.814e+02 3.962e+02 5.141e+02 6.603e+02 1.300e+03, threshold=1.028e+03, percent-clipped=5.0 2023-06-25 16:27:13,348 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1372632.0, ans=0.125 2023-06-25 16:27:20,347 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1372632.0, ans=0.125 2023-06-25 16:27:37,954 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1372692.0, ans=0.125 2023-06-25 16:27:52,435 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1372752.0, ans=0.0 2023-06-25 16:28:13,107 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.40 vs. limit=12.0 2023-06-25 16:28:30,834 INFO [train.py:996] (2/4) Epoch 8, batch 15350, loss[loss=0.2196, simple_loss=0.3134, pruned_loss=0.06291, over 21870.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.305, pruned_loss=0.07681, over 4279916.21 frames. ], batch size: 316, lr: 3.73e-03, grad_scale: 16.0 2023-06-25 16:28:40,272 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1372872.0, ans=0.1 2023-06-25 16:28:56,414 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.34 vs. limit=15.0 2023-06-25 16:29:44,911 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1373052.0, ans=0.1 2023-06-25 16:30:13,125 INFO [train.py:996] (2/4) Epoch 8, batch 15400, loss[loss=0.2363, simple_loss=0.304, pruned_loss=0.08436, over 21893.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.3064, pruned_loss=0.07562, over 4279821.71 frames. ], batch size: 371, lr: 3.73e-03, grad_scale: 16.0 2023-06-25 16:30:46,929 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.892e+02 4.124e+02 5.602e+02 8.412e+02 1.592e+03, threshold=1.120e+03, percent-clipped=11.0 2023-06-25 16:30:50,827 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1373232.0, ans=0.125 2023-06-25 16:30:59,900 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1373232.0, ans=0.05 2023-06-25 16:31:31,484 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1373352.0, ans=0.125 2023-06-25 16:32:01,297 INFO [train.py:996] (2/4) Epoch 8, batch 15450, loss[loss=0.1991, simple_loss=0.2742, pruned_loss=0.06206, over 21641.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.304, pruned_loss=0.0752, over 4267537.07 frames. ], batch size: 263, lr: 3.73e-03, grad_scale: 16.0 2023-06-25 16:32:30,038 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1373472.0, ans=0.125 2023-06-25 16:32:32,039 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1373472.0, ans=0.1 2023-06-25 16:32:39,222 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1373532.0, ans=0.125 2023-06-25 16:33:03,771 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1373592.0, ans=0.0 2023-06-25 16:33:20,346 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1373652.0, ans=0.125 2023-06-25 16:34:02,545 INFO [train.py:996] (2/4) Epoch 8, batch 15500, loss[loss=0.242, simple_loss=0.3252, pruned_loss=0.07945, over 21893.00 frames. ], tot_loss[loss=0.2276, simple_loss=0.3056, pruned_loss=0.07484, over 4261498.33 frames. ], batch size: 371, lr: 3.73e-03, grad_scale: 16.0 2023-06-25 16:34:27,243 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.729e+02 3.957e+02 5.678e+02 7.705e+02 1.506e+03, threshold=1.136e+03, percent-clipped=3.0 2023-06-25 16:34:57,981 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.89 vs. limit=6.0 2023-06-25 16:35:02,703 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1373892.0, ans=0.125 2023-06-25 16:35:02,709 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1373892.0, ans=0.0 2023-06-25 16:35:34,800 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1374012.0, ans=0.2 2023-06-25 16:35:50,750 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1374012.0, ans=0.035 2023-06-25 16:35:53,647 INFO [train.py:996] (2/4) Epoch 8, batch 15550, loss[loss=0.1743, simple_loss=0.247, pruned_loss=0.05082, over 21843.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.3029, pruned_loss=0.07262, over 4271284.07 frames. ], batch size: 107, lr: 3.73e-03, grad_scale: 16.0 2023-06-25 16:35:55,806 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1374072.0, ans=0.0 2023-06-25 16:37:42,357 INFO [train.py:996] (2/4) Epoch 8, batch 15600, loss[loss=0.2346, simple_loss=0.3046, pruned_loss=0.0823, over 21503.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.2967, pruned_loss=0.07133, over 4270271.32 frames. ], batch size: 441, lr: 3.73e-03, grad_scale: 32.0 2023-06-25 16:37:55,194 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.05 vs. limit=6.0 2023-06-25 16:38:01,280 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.645e+02 3.371e+02 3.943e+02 5.908e+02 1.274e+03, threshold=7.887e+02, percent-clipped=2.0 2023-06-25 16:38:05,477 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1374432.0, ans=0.1 2023-06-25 16:39:27,954 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1374612.0, ans=0.1 2023-06-25 16:39:30,803 INFO [train.py:996] (2/4) Epoch 8, batch 15650, loss[loss=0.2145, simple_loss=0.2851, pruned_loss=0.07188, over 21866.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.2944, pruned_loss=0.0706, over 4270389.18 frames. ], batch size: 98, lr: 3.73e-03, grad_scale: 16.0 2023-06-25 16:39:34,864 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1374672.0, ans=0.125 2023-06-25 16:40:18,136 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1374792.0, ans=0.0 2023-06-25 16:40:26,684 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1374792.0, ans=0.125 2023-06-25 16:41:19,311 INFO [train.py:996] (2/4) Epoch 8, batch 15700, loss[loss=0.1937, simple_loss=0.2629, pruned_loss=0.06221, over 21211.00 frames. ], tot_loss[loss=0.215, simple_loss=0.291, pruned_loss=0.06951, over 4253001.74 frames. ], batch size: 159, lr: 3.73e-03, grad_scale: 16.0 2023-06-25 16:41:37,394 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1375032.0, ans=0.125 2023-06-25 16:41:40,402 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.691e+02 3.513e+02 4.156e+02 5.605e+02 1.068e+03, threshold=8.312e+02, percent-clipped=8.0 2023-06-25 16:41:59,106 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_ff2.min_abs, batch_count=1375092.0, ans=0.1 2023-06-25 16:42:35,308 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1375152.0, ans=0.125 2023-06-25 16:43:06,899 INFO [train.py:996] (2/4) Epoch 8, batch 15750, loss[loss=0.2041, simple_loss=0.2762, pruned_loss=0.06596, over 21670.00 frames. ], tot_loss[loss=0.2122, simple_loss=0.2863, pruned_loss=0.06905, over 4251668.93 frames. ], batch size: 298, lr: 3.73e-03, grad_scale: 16.0 2023-06-25 16:43:07,393 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1375272.0, ans=0.125 2023-06-25 16:44:34,041 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1375512.0, ans=0.0 2023-06-25 16:44:55,727 INFO [train.py:996] (2/4) Epoch 8, batch 15800, loss[loss=0.2151, simple_loss=0.283, pruned_loss=0.07357, over 21292.00 frames. ], tot_loss[loss=0.2101, simple_loss=0.2827, pruned_loss=0.0688, over 4260342.06 frames. ], batch size: 548, lr: 3.72e-03, grad_scale: 16.0 2023-06-25 16:45:16,668 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.881e+02 4.132e+02 5.788e+02 8.606e+02 2.042e+03, threshold=1.158e+03, percent-clipped=26.0 2023-06-25 16:45:18,962 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1375632.0, ans=0.1 2023-06-25 16:45:37,820 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.24 vs. limit=22.5 2023-06-25 16:46:10,230 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1375752.0, ans=0.125 2023-06-25 16:46:44,088 INFO [train.py:996] (2/4) Epoch 8, batch 15850, loss[loss=0.215, simple_loss=0.2878, pruned_loss=0.07111, over 21688.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.286, pruned_loss=0.07134, over 4263382.96 frames. ], batch size: 247, lr: 3.72e-03, grad_scale: 16.0 2023-06-25 16:47:38,930 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1375992.0, ans=0.2 2023-06-25 16:48:32,035 INFO [train.py:996] (2/4) Epoch 8, batch 15900, loss[loss=0.2017, simple_loss=0.2914, pruned_loss=0.05597, over 21530.00 frames. ], tot_loss[loss=0.2142, simple_loss=0.2853, pruned_loss=0.07152, over 4267348.90 frames. ], batch size: 230, lr: 3.72e-03, grad_scale: 16.0 2023-06-25 16:48:52,537 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.061e+02 4.407e+02 5.744e+02 8.356e+02 1.559e+03, threshold=1.149e+03, percent-clipped=5.0 2023-06-25 16:49:14,243 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1376292.0, ans=0.0 2023-06-25 16:49:27,955 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1376352.0, ans=0.125 2023-06-25 16:50:06,042 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1376412.0, ans=0.1 2023-06-25 16:50:13,141 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1376412.0, ans=0.2 2023-06-25 16:50:19,048 INFO [train.py:996] (2/4) Epoch 8, batch 15950, loss[loss=0.2439, simple_loss=0.323, pruned_loss=0.08243, over 21662.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.286, pruned_loss=0.06858, over 4258985.37 frames. ], batch size: 441, lr: 3.72e-03, grad_scale: 16.0 2023-06-25 16:52:04,295 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_na.min_abs, batch_count=1376712.0, ans=0.02 2023-06-25 16:52:10,850 INFO [train.py:996] (2/4) Epoch 8, batch 16000, loss[loss=0.2193, simple_loss=0.3149, pruned_loss=0.06185, over 21286.00 frames. ], tot_loss[loss=0.2115, simple_loss=0.2878, pruned_loss=0.06757, over 4256539.77 frames. ], batch size: 548, lr: 3.72e-03, grad_scale: 32.0 2023-06-25 16:52:16,807 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1376772.0, ans=0.125 2023-06-25 16:52:29,221 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1376832.0, ans=0.125 2023-06-25 16:52:31,817 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.505e+02 3.799e+02 4.877e+02 8.252e+02 1.708e+03, threshold=9.755e+02, percent-clipped=5.0 2023-06-25 16:52:34,739 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.99 vs. limit=6.0 2023-06-25 16:52:36,580 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.51 vs. limit=15.0 2023-06-25 16:52:45,184 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1376892.0, ans=0.125 2023-06-25 16:53:03,024 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1376892.0, ans=0.0 2023-06-25 16:53:19,132 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.11 vs. limit=22.5 2023-06-25 16:53:51,863 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.31 vs. limit=15.0 2023-06-25 16:53:59,512 INFO [train.py:996] (2/4) Epoch 8, batch 16050, loss[loss=0.2028, simple_loss=0.2902, pruned_loss=0.05775, over 21639.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2894, pruned_loss=0.06572, over 4267810.21 frames. ], batch size: 230, lr: 3.72e-03, grad_scale: 16.0 2023-06-25 16:54:00,415 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1377072.0, ans=0.125 2023-06-25 16:54:03,372 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1377072.0, ans=0.025 2023-06-25 16:55:42,986 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1377312.0, ans=0.1 2023-06-25 16:55:47,397 INFO [train.py:996] (2/4) Epoch 8, batch 16100, loss[loss=0.222, simple_loss=0.2985, pruned_loss=0.07268, over 21850.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2942, pruned_loss=0.06739, over 4269654.42 frames. ], batch size: 332, lr: 3.72e-03, grad_scale: 16.0 2023-06-25 16:56:10,219 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.825e+02 4.315e+02 5.631e+02 9.006e+02 2.276e+03, threshold=1.126e+03, percent-clipped=22.0 2023-06-25 16:56:11,154 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1377432.0, ans=0.0 2023-06-25 16:56:31,513 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1377492.0, ans=0.07 2023-06-25 16:57:20,491 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1377612.0, ans=0.09899494936611666 2023-06-25 16:57:35,032 INFO [train.py:996] (2/4) Epoch 8, batch 16150, loss[loss=0.2324, simple_loss=0.3082, pruned_loss=0.07831, over 21483.00 frames. ], tot_loss[loss=0.217, simple_loss=0.294, pruned_loss=0.06998, over 4278892.60 frames. ], batch size: 548, lr: 3.72e-03, grad_scale: 16.0 2023-06-25 16:57:51,108 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1377732.0, ans=0.0 2023-06-25 16:58:52,402 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1377852.0, ans=0.1 2023-06-25 16:59:23,927 INFO [train.py:996] (2/4) Epoch 8, batch 16200, loss[loss=0.2553, simple_loss=0.3311, pruned_loss=0.08972, over 21497.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.2982, pruned_loss=0.0713, over 4283226.71 frames. ], batch size: 194, lr: 3.72e-03, grad_scale: 16.0 2023-06-25 16:59:39,968 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1378032.0, ans=0.125 2023-06-25 16:59:46,146 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.109e+02 4.018e+02 5.082e+02 7.447e+02 1.479e+03, threshold=1.016e+03, percent-clipped=6.0 2023-06-25 17:00:17,280 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=1378092.0, ans=15.0 2023-06-25 17:01:04,317 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.79 vs. limit=22.5 2023-06-25 17:01:11,824 INFO [train.py:996] (2/4) Epoch 8, batch 16250, loss[loss=0.1889, simple_loss=0.2668, pruned_loss=0.05547, over 21316.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.2981, pruned_loss=0.07133, over 4284340.55 frames. ], batch size: 176, lr: 3.72e-03, grad_scale: 16.0 2023-06-25 17:01:16,585 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.69 vs. limit=12.0 2023-06-25 17:01:24,961 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1378272.0, ans=10.0 2023-06-25 17:01:30,033 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1378332.0, ans=0.0 2023-06-25 17:01:37,557 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1378332.0, ans=0.1 2023-06-25 17:01:39,435 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1378332.0, ans=0.0 2023-06-25 17:02:03,859 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.44 vs. limit=22.5 2023-06-25 17:03:00,463 INFO [train.py:996] (2/4) Epoch 8, batch 16300, loss[loss=0.2214, simple_loss=0.3021, pruned_loss=0.07029, over 21353.00 frames. ], tot_loss[loss=0.214, simple_loss=0.293, pruned_loss=0.06746, over 4276808.87 frames. ], batch size: 471, lr: 3.72e-03, grad_scale: 16.0 2023-06-25 17:03:03,107 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1378572.0, ans=0.125 2023-06-25 17:03:24,191 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.317e+02 3.309e+02 4.494e+02 6.869e+02 1.781e+03, threshold=8.988e+02, percent-clipped=11.0 2023-06-25 17:03:41,538 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.01 vs. limit=10.0 2023-06-25 17:04:19,517 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1378752.0, ans=0.125 2023-06-25 17:04:50,594 INFO [train.py:996] (2/4) Epoch 8, batch 16350, loss[loss=0.266, simple_loss=0.3497, pruned_loss=0.09117, over 21795.00 frames. ], tot_loss[loss=0.214, simple_loss=0.293, pruned_loss=0.06749, over 4273392.05 frames. ], batch size: 124, lr: 3.72e-03, grad_scale: 16.0 2023-06-25 17:05:06,352 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.75 vs. limit=15.0 2023-06-25 17:05:23,406 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1378932.0, ans=0.04949747468305833 2023-06-25 17:06:24,296 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.81 vs. limit=12.0 2023-06-25 17:06:39,454 INFO [train.py:996] (2/4) Epoch 8, batch 16400, loss[loss=0.2037, simple_loss=0.2798, pruned_loss=0.06377, over 21848.00 frames. ], tot_loss[loss=0.2183, simple_loss=0.2975, pruned_loss=0.06957, over 4277190.20 frames. ], batch size: 298, lr: 3.72e-03, grad_scale: 32.0 2023-06-25 17:06:39,967 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1379172.0, ans=0.2 2023-06-25 17:06:50,456 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1379172.0, ans=0.1 2023-06-25 17:07:06,278 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1379232.0, ans=0.2 2023-06-25 17:07:08,104 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1379232.0, ans=0.0 2023-06-25 17:07:09,119 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.039e+02 4.348e+02 5.266e+02 7.750e+02 2.110e+03, threshold=1.053e+03, percent-clipped=17.0 2023-06-25 17:07:46,723 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.17 vs. limit=15.0 2023-06-25 17:08:09,264 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1379412.0, ans=0.0 2023-06-25 17:08:22,758 INFO [train.py:996] (2/4) Epoch 8, batch 16450, loss[loss=0.2139, simple_loss=0.2936, pruned_loss=0.06708, over 21876.00 frames. ], tot_loss[loss=0.22, simple_loss=0.2977, pruned_loss=0.07117, over 4288482.83 frames. ], batch size: 351, lr: 3.72e-03, grad_scale: 16.0 2023-06-25 17:08:36,042 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1379472.0, ans=0.2 2023-06-25 17:08:54,264 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1379532.0, ans=0.125 2023-06-25 17:09:01,036 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1379532.0, ans=0.2 2023-06-25 17:09:25,175 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1379592.0, ans=0.125 2023-06-25 17:10:12,899 INFO [train.py:996] (2/4) Epoch 8, batch 16500, loss[loss=0.2458, simple_loss=0.3233, pruned_loss=0.08416, over 21725.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.2969, pruned_loss=0.07186, over 4286206.17 frames. ], batch size: 441, lr: 3.72e-03, grad_scale: 16.0 2023-06-25 17:10:19,044 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1379772.0, ans=0.2 2023-06-25 17:10:21,102 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1379772.0, ans=0.0 2023-06-25 17:10:43,405 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.095e+02 4.406e+02 5.923e+02 9.341e+02 2.012e+03, threshold=1.185e+03, percent-clipped=18.0 2023-06-25 17:12:03,237 INFO [train.py:996] (2/4) Epoch 8, batch 16550, loss[loss=0.2258, simple_loss=0.3224, pruned_loss=0.06459, over 21271.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.2961, pruned_loss=0.06977, over 4277621.17 frames. ], batch size: 548, lr: 3.72e-03, grad_scale: 16.0 2023-06-25 17:12:29,611 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_na.min_abs, batch_count=1380072.0, ans=0.02 2023-06-25 17:12:43,426 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1380132.0, ans=0.1 2023-06-25 17:13:12,331 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1380192.0, ans=0.0 2023-06-25 17:14:05,621 INFO [train.py:996] (2/4) Epoch 8, batch 16600, loss[loss=0.2562, simple_loss=0.3427, pruned_loss=0.08489, over 21255.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.3019, pruned_loss=0.07172, over 4280704.61 frames. ], batch size: 159, lr: 3.72e-03, grad_scale: 16.0 2023-06-25 17:14:08,141 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1380372.0, ans=0.0 2023-06-25 17:14:17,261 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.65 vs. limit=15.0 2023-06-25 17:14:20,049 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=1380372.0, ans=0.025 2023-06-25 17:14:41,013 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.056e+02 4.921e+02 6.632e+02 9.394e+02 2.372e+03, threshold=1.326e+03, percent-clipped=11.0 2023-06-25 17:15:07,129 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.86 vs. limit=15.0 2023-06-25 17:15:23,056 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1380552.0, ans=0.1 2023-06-25 17:15:26,346 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1380552.0, ans=0.0 2023-06-25 17:16:01,866 INFO [train.py:996] (2/4) Epoch 8, batch 16650, loss[loss=0.251, simple_loss=0.3286, pruned_loss=0.08665, over 21314.00 frames. ], tot_loss[loss=0.23, simple_loss=0.3112, pruned_loss=0.07442, over 4279983.61 frames. ], batch size: 176, lr: 3.72e-03, grad_scale: 16.0 2023-06-25 17:16:14,843 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1380672.0, ans=0.0 2023-06-25 17:16:33,126 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1380732.0, ans=0.125 2023-06-25 17:16:55,463 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1380792.0, ans=0.2 2023-06-25 17:17:21,104 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1380852.0, ans=0.125 2023-06-25 17:17:39,620 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1380912.0, ans=0.025 2023-06-25 17:17:59,218 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1380972.0, ans=0.2 2023-06-25 17:18:00,234 INFO [train.py:996] (2/4) Epoch 8, batch 16700, loss[loss=0.1813, simple_loss=0.2466, pruned_loss=0.05799, over 21404.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3135, pruned_loss=0.07615, over 4284853.11 frames. ], batch size: 194, lr: 3.72e-03, grad_scale: 16.0 2023-06-25 17:18:19,722 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1381032.0, ans=0.125 2023-06-25 17:18:21,407 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1381032.0, ans=0.125 2023-06-25 17:18:26,043 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.158e+02 5.069e+02 7.220e+02 1.088e+03 2.234e+03, threshold=1.444e+03, percent-clipped=12.0 2023-06-25 17:19:04,441 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1381092.0, ans=0.125 2023-06-25 17:19:13,826 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1381152.0, ans=0.125 2023-06-25 17:19:26,320 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1381152.0, ans=0.0 2023-06-25 17:19:49,558 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1381212.0, ans=0.04949747468305833 2023-06-25 17:19:54,804 INFO [train.py:996] (2/4) Epoch 8, batch 16750, loss[loss=0.2056, simple_loss=0.3214, pruned_loss=0.04487, over 20812.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3155, pruned_loss=0.07805, over 4276487.56 frames. ], batch size: 608, lr: 3.72e-03, grad_scale: 16.0 2023-06-25 17:20:53,796 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 17:20:55,395 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1381392.0, ans=0.2 2023-06-25 17:21:06,299 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1381452.0, ans=0.2 2023-06-25 17:21:11,439 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=1381452.0, ans=0.5 2023-06-25 17:21:11,478 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1381452.0, ans=0.0 2023-06-25 17:21:30,691 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1381512.0, ans=10.0 2023-06-25 17:21:47,585 INFO [train.py:996] (2/4) Epoch 8, batch 16800, loss[loss=0.2064, simple_loss=0.2802, pruned_loss=0.06628, over 21673.00 frames. ], tot_loss[loss=0.236, simple_loss=0.3175, pruned_loss=0.07725, over 4272092.95 frames. ], batch size: 263, lr: 3.72e-03, grad_scale: 16.0 2023-06-25 17:22:18,698 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.357e+02 4.342e+02 5.532e+02 7.799e+02 1.934e+03, threshold=1.106e+03, percent-clipped=5.0 2023-06-25 17:22:33,416 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1381692.0, ans=0.1 2023-06-25 17:22:33,594 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1381692.0, ans=0.125 2023-06-25 17:22:35,831 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=1381692.0, ans=15.0 2023-06-25 17:22:51,542 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.03 vs. limit=15.0 2023-06-25 17:23:13,191 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1381812.0, ans=0.125 2023-06-25 17:23:24,358 INFO [train.py:996] (2/4) Epoch 8, batch 16850, loss[loss=0.1995, simple_loss=0.2945, pruned_loss=0.05222, over 18304.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.313, pruned_loss=0.07757, over 4274812.36 frames. ], batch size: 63, lr: 3.72e-03, grad_scale: 16.0 2023-06-25 17:23:35,649 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1381872.0, ans=0.2 2023-06-25 17:23:41,428 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.79 vs. limit=10.0 2023-06-25 17:24:27,777 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1381992.0, ans=0.125 2023-06-25 17:24:30,680 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1381992.0, ans=0.0 2023-06-25 17:24:32,700 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1381992.0, ans=0.125 2023-06-25 17:24:33,136 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.29 vs. limit=15.0 2023-06-25 17:24:38,076 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1382052.0, ans=0.125 2023-06-25 17:24:57,297 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.20 vs. limit=12.0 2023-06-25 17:25:11,474 INFO [train.py:996] (2/4) Epoch 8, batch 16900, loss[loss=0.2062, simple_loss=0.2751, pruned_loss=0.06865, over 21279.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.307, pruned_loss=0.0752, over 4284409.42 frames. ], batch size: 144, lr: 3.72e-03, grad_scale: 16.0 2023-06-25 17:25:25,964 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1382172.0, ans=0.125 2023-06-25 17:25:58,355 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.598e+02 4.085e+02 5.568e+02 7.476e+02 1.428e+03, threshold=1.114e+03, percent-clipped=3.0 2023-06-25 17:26:07,733 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_positive, batch_count=1382292.0, ans=0.05 2023-06-25 17:26:09,443 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1382292.0, ans=0.1 2023-06-25 17:26:54,648 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1382412.0, ans=0.125 2023-06-25 17:26:59,149 INFO [train.py:996] (2/4) Epoch 8, batch 16950, loss[loss=0.2031, simple_loss=0.2941, pruned_loss=0.0561, over 19958.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.3004, pruned_loss=0.07343, over 4278988.56 frames. ], batch size: 703, lr: 3.72e-03, grad_scale: 16.0 2023-06-25 17:28:02,695 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.70 vs. limit=15.0 2023-06-25 17:28:23,727 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1382652.0, ans=0.125 2023-06-25 17:28:27,585 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.18 vs. limit=6.0 2023-06-25 17:28:53,783 INFO [train.py:996] (2/4) Epoch 8, batch 17000, loss[loss=0.2058, simple_loss=0.2992, pruned_loss=0.05619, over 21065.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.2997, pruned_loss=0.07333, over 4278496.00 frames. ], batch size: 607, lr: 3.71e-03, grad_scale: 16.0 2023-06-25 17:28:54,417 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 17:29:35,996 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.768e+02 4.400e+02 6.237e+02 1.054e+03 1.925e+03, threshold=1.247e+03, percent-clipped=22.0 2023-06-25 17:29:55,991 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1382892.0, ans=0.125 2023-06-25 17:30:29,752 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.87 vs. limit=15.0 2023-06-25 17:30:36,118 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1383012.0, ans=0.1 2023-06-25 17:30:47,374 INFO [train.py:996] (2/4) Epoch 8, batch 17050, loss[loss=0.2475, simple_loss=0.3314, pruned_loss=0.08184, over 21779.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.3056, pruned_loss=0.07549, over 4278786.85 frames. ], batch size: 414, lr: 3.71e-03, grad_scale: 16.0 2023-06-25 17:31:43,731 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1383252.0, ans=0.125 2023-06-25 17:32:21,216 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.65 vs. limit=12.0 2023-06-25 17:32:29,757 INFO [train.py:996] (2/4) Epoch 8, batch 17100, loss[loss=0.2015, simple_loss=0.2637, pruned_loss=0.06961, over 21497.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.3058, pruned_loss=0.07678, over 4283326.50 frames. ], batch size: 212, lr: 3.71e-03, grad_scale: 16.0 2023-06-25 17:33:06,776 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.094e+02 4.538e+02 6.730e+02 8.383e+02 1.322e+03, threshold=1.346e+03, percent-clipped=2.0 2023-06-25 17:33:16,800 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.69 vs. limit=22.5 2023-06-25 17:33:56,774 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1383612.0, ans=0.125 2023-06-25 17:34:20,856 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1383612.0, ans=0.025 2023-06-25 17:34:23,425 INFO [train.py:996] (2/4) Epoch 8, batch 17150, loss[loss=0.1816, simple_loss=0.2559, pruned_loss=0.05364, over 21348.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.3015, pruned_loss=0.07613, over 4284476.04 frames. ], batch size: 159, lr: 3.71e-03, grad_scale: 16.0 2023-06-25 17:34:57,390 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1383732.0, ans=0.2 2023-06-25 17:35:13,351 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1383792.0, ans=0.125 2023-06-25 17:35:20,737 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1383852.0, ans=0.125 2023-06-25 17:35:40,710 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1383852.0, ans=0.1 2023-06-25 17:36:18,574 INFO [train.py:996] (2/4) Epoch 8, batch 17200, loss[loss=0.2455, simple_loss=0.3197, pruned_loss=0.08561, over 21543.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.3011, pruned_loss=0.07522, over 4281097.53 frames. ], batch size: 389, lr: 3.71e-03, grad_scale: 32.0 2023-06-25 17:36:19,138 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1383972.0, ans=0.1 2023-06-25 17:36:31,407 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1383972.0, ans=0.0 2023-06-25 17:36:38,377 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1384032.0, ans=0.125 2023-06-25 17:36:44,520 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.991e+02 4.225e+02 5.384e+02 7.580e+02 1.533e+03, threshold=1.077e+03, percent-clipped=1.0 2023-06-25 17:37:12,206 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 17:38:07,342 INFO [train.py:996] (2/4) Epoch 8, batch 17250, loss[loss=0.2464, simple_loss=0.3344, pruned_loss=0.07917, over 21724.00 frames. ], tot_loss[loss=0.229, simple_loss=0.3042, pruned_loss=0.07691, over 4284367.17 frames. ], batch size: 124, lr: 3.71e-03, grad_scale: 16.0 2023-06-25 17:38:36,871 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1384332.0, ans=0.1 2023-06-25 17:39:20,900 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1384452.0, ans=0.0 2023-06-25 17:39:36,680 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1384452.0, ans=0.0 2023-06-25 17:39:48,564 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1384512.0, ans=0.05 2023-06-25 17:39:50,691 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1384512.0, ans=0.125 2023-06-25 17:39:57,100 INFO [train.py:996] (2/4) Epoch 8, batch 17300, loss[loss=0.1825, simple_loss=0.2213, pruned_loss=0.07182, over 20086.00 frames. ], tot_loss[loss=0.2352, simple_loss=0.3106, pruned_loss=0.07986, over 4281800.97 frames. ], batch size: 705, lr: 3.71e-03, grad_scale: 16.0 2023-06-25 17:40:11,905 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1384572.0, ans=0.0 2023-06-25 17:40:25,193 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.394e+02 4.465e+02 6.350e+02 1.043e+03 2.141e+03, threshold=1.270e+03, percent-clipped=19.0 2023-06-25 17:41:45,720 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.22 vs. limit=6.0 2023-06-25 17:41:47,992 INFO [train.py:996] (2/4) Epoch 8, batch 17350, loss[loss=0.1933, simple_loss=0.2766, pruned_loss=0.05501, over 21809.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3106, pruned_loss=0.07911, over 4284719.68 frames. ], batch size: 282, lr: 3.71e-03, grad_scale: 16.0 2023-06-25 17:41:54,190 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1384872.0, ans=0.04949747468305833 2023-06-25 17:41:57,493 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1384872.0, ans=10.0 2023-06-25 17:42:48,671 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 17:43:04,949 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1385052.0, ans=0.125 2023-06-25 17:43:38,078 INFO [train.py:996] (2/4) Epoch 8, batch 17400, loss[loss=0.2332, simple_loss=0.3076, pruned_loss=0.07937, over 20665.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.3082, pruned_loss=0.07621, over 4286271.96 frames. ], batch size: 607, lr: 3.71e-03, grad_scale: 16.0 2023-06-25 17:43:41,174 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.58 vs. limit=22.5 2023-06-25 17:44:22,055 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.624e+02 3.735e+02 4.973e+02 6.706e+02 2.674e+03, threshold=9.946e+02, percent-clipped=3.0 2023-06-25 17:44:57,999 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1385352.0, ans=0.125 2023-06-25 17:45:07,446 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.50 vs. limit=22.5 2023-06-25 17:45:32,735 INFO [train.py:996] (2/4) Epoch 8, batch 17450, loss[loss=0.1714, simple_loss=0.2629, pruned_loss=0.04, over 21767.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.3064, pruned_loss=0.07423, over 4282808.01 frames. ], batch size: 282, lr: 3.71e-03, grad_scale: 16.0 2023-06-25 17:46:13,489 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1385532.0, ans=0.0 2023-06-25 17:46:52,644 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1385652.0, ans=0.125 2023-06-25 17:46:54,517 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1385652.0, ans=0.1 2023-06-25 17:47:20,346 INFO [train.py:996] (2/4) Epoch 8, batch 17500, loss[loss=0.2147, simple_loss=0.2882, pruned_loss=0.07066, over 21874.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.3021, pruned_loss=0.07181, over 4280919.67 frames. ], batch size: 351, lr: 3.71e-03, grad_scale: 16.0 2023-06-25 17:47:44,266 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.61 vs. limit=15.0 2023-06-25 17:47:57,726 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.917e+02 3.737e+02 5.034e+02 7.979e+02 1.418e+03, threshold=1.007e+03, percent-clipped=12.0 2023-06-25 17:48:00,276 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1385832.0, ans=0.0 2023-06-25 17:48:01,923 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1385832.0, ans=0.1 2023-06-25 17:48:10,731 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1385892.0, ans=0.04949747468305833 2023-06-25 17:48:39,401 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1386012.0, ans=0.05 2023-06-25 17:48:39,472 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1386012.0, ans=0.1 2023-06-25 17:48:46,861 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1386012.0, ans=0.125 2023-06-25 17:49:07,125 INFO [train.py:996] (2/4) Epoch 8, batch 17550, loss[loss=0.2105, simple_loss=0.3006, pruned_loss=0.06021, over 21208.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.3024, pruned_loss=0.07084, over 4270625.99 frames. ], batch size: 143, lr: 3.71e-03, grad_scale: 16.0 2023-06-25 17:49:37,933 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.12 vs. limit=15.0 2023-06-25 17:50:54,189 INFO [train.py:996] (2/4) Epoch 8, batch 17600, loss[loss=0.2216, simple_loss=0.3122, pruned_loss=0.06554, over 21826.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.3047, pruned_loss=0.07131, over 4270672.84 frames. ], batch size: 118, lr: 3.71e-03, grad_scale: 32.0 2023-06-25 17:51:29,844 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.87 vs. limit=22.5 2023-06-25 17:51:31,495 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.12 vs. limit=10.0 2023-06-25 17:51:33,998 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.904e+02 3.918e+02 5.459e+02 7.837e+02 1.902e+03, threshold=1.092e+03, percent-clipped=12.0 2023-06-25 17:52:49,584 INFO [train.py:996] (2/4) Epoch 8, batch 17650, loss[loss=0.1862, simple_loss=0.2517, pruned_loss=0.0603, over 21325.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.3031, pruned_loss=0.07181, over 4274330.40 frames. ], batch size: 211, lr: 3.71e-03, grad_scale: 32.0 2023-06-25 17:53:02,563 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1386672.0, ans=0.125 2023-06-25 17:53:08,074 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 17:53:31,268 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1386792.0, ans=0.125 2023-06-25 17:54:37,209 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.10 vs. limit=15.0 2023-06-25 17:54:39,351 INFO [train.py:996] (2/4) Epoch 8, batch 17700, loss[loss=0.1447, simple_loss=0.2104, pruned_loss=0.0395, over 21509.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.2977, pruned_loss=0.06902, over 4262575.36 frames. ], batch size: 212, lr: 3.71e-03, grad_scale: 16.0 2023-06-25 17:54:40,034 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 17:54:41,965 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=1386972.0, ans=10.0 2023-06-25 17:54:54,753 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1386972.0, ans=0.07 2023-06-25 17:54:58,168 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1386972.0, ans=0.125 2023-06-25 17:55:08,596 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1387032.0, ans=0.5 2023-06-25 17:55:14,589 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.821e+02 4.539e+02 6.208e+02 9.459e+02 1.772e+03, threshold=1.242e+03, percent-clipped=17.0 2023-06-25 17:55:23,320 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.08 vs. limit=10.0 2023-06-25 17:56:34,265 INFO [train.py:996] (2/4) Epoch 8, batch 17750, loss[loss=0.2372, simple_loss=0.3196, pruned_loss=0.07739, over 21988.00 frames. ], tot_loss[loss=0.225, simple_loss=0.3051, pruned_loss=0.07251, over 4265220.06 frames. ], batch size: 317, lr: 3.71e-03, grad_scale: 16.0 2023-06-25 17:56:40,879 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1387272.0, ans=0.025 2023-06-25 17:57:00,005 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1387332.0, ans=0.125 2023-06-25 17:57:46,920 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1387452.0, ans=0.1 2023-06-25 17:57:56,165 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1387452.0, ans=0.0 2023-06-25 17:58:04,858 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1387452.0, ans=0.125 2023-06-25 17:58:04,914 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1387452.0, ans=0.2 2023-06-25 17:58:19,018 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1387512.0, ans=0.125 2023-06-25 17:58:25,863 INFO [train.py:996] (2/4) Epoch 8, batch 17800, loss[loss=0.2018, simple_loss=0.2752, pruned_loss=0.06422, over 21438.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.3039, pruned_loss=0.07183, over 4259125.37 frames. ], batch size: 194, lr: 3.71e-03, grad_scale: 16.0 2023-06-25 17:58:55,813 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.433e+02 4.160e+02 4.945e+02 7.686e+02 1.227e+03, threshold=9.890e+02, percent-clipped=0.0 2023-06-25 17:59:37,373 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1387752.0, ans=0.2 2023-06-25 18:00:10,400 INFO [train.py:996] (2/4) Epoch 8, batch 17850, loss[loss=0.2333, simple_loss=0.3084, pruned_loss=0.07913, over 21988.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.3037, pruned_loss=0.07225, over 4265656.86 frames. ], batch size: 317, lr: 3.71e-03, grad_scale: 16.0 2023-06-25 18:00:24,238 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.61 vs. limit=22.5 2023-06-25 18:00:47,971 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1387932.0, ans=0.125 2023-06-25 18:01:20,198 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1388052.0, ans=0.125 2023-06-25 18:01:49,165 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.43 vs. limit=15.0 2023-06-25 18:01:54,656 INFO [train.py:996] (2/4) Epoch 8, batch 17900, loss[loss=0.2364, simple_loss=0.3074, pruned_loss=0.08268, over 21339.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.3089, pruned_loss=0.07382, over 4266472.19 frames. ], batch size: 549, lr: 3.71e-03, grad_scale: 16.0 2023-06-25 18:02:36,373 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1388232.0, ans=0.125 2023-06-25 18:02:40,621 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.096e+02 4.760e+02 6.226e+02 9.356e+02 2.163e+03, threshold=1.245e+03, percent-clipped=21.0 2023-06-25 18:02:45,499 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.24 vs. limit=15.0 2023-06-25 18:03:08,300 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1388352.0, ans=0.125 2023-06-25 18:03:11,797 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1388352.0, ans=0.125 2023-06-25 18:03:18,787 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1388352.0, ans=0.0 2023-06-25 18:03:44,513 INFO [train.py:996] (2/4) Epoch 8, batch 17950, loss[loss=0.1844, simple_loss=0.2749, pruned_loss=0.04698, over 21647.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.3095, pruned_loss=0.07088, over 4267860.48 frames. ], batch size: 247, lr: 3.71e-03, grad_scale: 16.0 2023-06-25 18:04:42,671 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1388592.0, ans=0.0 2023-06-25 18:05:27,066 INFO [train.py:996] (2/4) Epoch 8, batch 18000, loss[loss=0.2076, simple_loss=0.2824, pruned_loss=0.06643, over 21616.00 frames. ], tot_loss[loss=0.221, simple_loss=0.3035, pruned_loss=0.06922, over 4263046.67 frames. ], batch size: 332, lr: 3.71e-03, grad_scale: 32.0 2023-06-25 18:05:27,067 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-25 18:05:48,125 INFO [train.py:1028] (2/4) Epoch 8, validation: loss=0.2638, simple_loss=0.3571, pruned_loss=0.08527, over 1796401.00 frames. 2023-06-25 18:05:48,126 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 23731MB 2023-06-25 18:05:53,832 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1388772.0, ans=0.125 2023-06-25 18:06:13,221 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1388832.0, ans=10.0 2023-06-25 18:06:20,116 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1388832.0, ans=0.2 2023-06-25 18:06:23,290 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.397e+02 3.503e+02 4.294e+02 6.004e+02 1.457e+03, threshold=8.588e+02, percent-clipped=3.0 2023-06-25 18:06:57,731 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.69 vs. limit=15.0 2023-06-25 18:07:02,222 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1389012.0, ans=0.125 2023-06-25 18:07:37,520 INFO [train.py:996] (2/4) Epoch 8, batch 18050, loss[loss=0.2263, simple_loss=0.2951, pruned_loss=0.07879, over 21892.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.2986, pruned_loss=0.0685, over 4263101.71 frames. ], batch size: 317, lr: 3.71e-03, grad_scale: 32.0 2023-06-25 18:07:50,572 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1389072.0, ans=0.125 2023-06-25 18:08:31,321 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.96 vs. limit=15.0 2023-06-25 18:09:33,661 INFO [train.py:996] (2/4) Epoch 8, batch 18100, loss[loss=0.2076, simple_loss=0.2855, pruned_loss=0.06486, over 21196.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.2995, pruned_loss=0.0705, over 4266349.69 frames. ], batch size: 176, lr: 3.71e-03, grad_scale: 16.0 2023-06-25 18:10:05,557 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.803e+02 3.773e+02 4.901e+02 6.840e+02 2.108e+03, threshold=9.801e+02, percent-clipped=15.0 2023-06-25 18:10:11,633 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1389492.0, ans=0.125 2023-06-25 18:10:20,126 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1389492.0, ans=0.125 2023-06-25 18:10:51,237 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1389552.0, ans=0.0 2023-06-25 18:11:22,664 INFO [train.py:996] (2/4) Epoch 8, batch 18150, loss[loss=0.2168, simple_loss=0.3002, pruned_loss=0.06668, over 21903.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.3002, pruned_loss=0.07033, over 4253505.50 frames. ], batch size: 373, lr: 3.71e-03, grad_scale: 16.0 2023-06-25 18:11:51,274 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.85 vs. limit=15.0 2023-06-25 18:12:53,811 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1389912.0, ans=0.0 2023-06-25 18:13:10,117 INFO [train.py:996] (2/4) Epoch 8, batch 18200, loss[loss=0.1954, simple_loss=0.2671, pruned_loss=0.06187, over 21718.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.2953, pruned_loss=0.07026, over 4264910.39 frames. ], batch size: 112, lr: 3.71e-03, grad_scale: 16.0 2023-06-25 18:13:39,565 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1390032.0, ans=0.0 2023-06-25 18:13:40,440 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.735e+02 4.047e+02 5.658e+02 8.715e+02 2.136e+03, threshold=1.132e+03, percent-clipped=16.0 2023-06-25 18:13:58,263 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.88 vs. limit=22.5 2023-06-25 18:14:29,341 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1390212.0, ans=0.1 2023-06-25 18:14:32,900 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1390212.0, ans=0.125 2023-06-25 18:14:49,889 INFO [train.py:996] (2/4) Epoch 8, batch 18250, loss[loss=0.1881, simple_loss=0.262, pruned_loss=0.05709, over 21819.00 frames. ], tot_loss[loss=0.2122, simple_loss=0.288, pruned_loss=0.06822, over 4274273.05 frames. ], batch size: 298, lr: 3.70e-03, grad_scale: 16.0 2023-06-25 18:15:02,678 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1390272.0, ans=0.125 2023-06-25 18:16:26,098 INFO [train.py:996] (2/4) Epoch 8, batch 18300, loss[loss=0.1776, simple_loss=0.253, pruned_loss=0.05109, over 21872.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.2874, pruned_loss=0.06783, over 4263706.35 frames. ], batch size: 98, lr: 3.70e-03, grad_scale: 16.0 2023-06-25 18:17:12,346 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.529e+02 4.046e+02 5.831e+02 1.006e+03 2.196e+03, threshold=1.166e+03, percent-clipped=19.0 2023-06-25 18:17:21,624 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 18:17:31,890 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1390752.0, ans=0.125 2023-06-25 18:18:03,008 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1390812.0, ans=0.125 2023-06-25 18:18:12,750 INFO [train.py:996] (2/4) Epoch 8, batch 18350, loss[loss=0.2089, simple_loss=0.3122, pruned_loss=0.05286, over 21398.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2947, pruned_loss=0.06818, over 4260628.63 frames. ], batch size: 211, lr: 3.70e-03, grad_scale: 16.0 2023-06-25 18:18:40,195 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1390932.0, ans=0.1 2023-06-25 18:18:50,964 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1390932.0, ans=0.04949747468305833 2023-06-25 18:19:16,382 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1391052.0, ans=0.0 2023-06-25 18:19:56,472 INFO [train.py:996] (2/4) Epoch 8, batch 18400, loss[loss=0.1716, simple_loss=0.253, pruned_loss=0.04508, over 21556.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2923, pruned_loss=0.06738, over 4259753.63 frames. ], batch size: 195, lr: 3.70e-03, grad_scale: 32.0 2023-06-25 18:20:26,568 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1391232.0, ans=0.0 2023-06-25 18:20:30,265 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1391232.0, ans=0.1 2023-06-25 18:20:38,567 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.600e+02 3.733e+02 5.113e+02 7.460e+02 1.718e+03, threshold=1.023e+03, percent-clipped=6.0 2023-06-25 18:20:46,428 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1391292.0, ans=0.125 2023-06-25 18:21:28,267 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1391412.0, ans=0.125 2023-06-25 18:21:45,625 INFO [train.py:996] (2/4) Epoch 8, batch 18450, loss[loss=0.172, simple_loss=0.2587, pruned_loss=0.04266, over 21682.00 frames. ], tot_loss[loss=0.209, simple_loss=0.2888, pruned_loss=0.06457, over 4266429.55 frames. ], batch size: 247, lr: 3.70e-03, grad_scale: 32.0 2023-06-25 18:22:12,641 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1391532.0, ans=0.125 2023-06-25 18:22:29,893 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1391592.0, ans=0.1 2023-06-25 18:22:34,214 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.81 vs. limit=6.0 2023-06-25 18:22:52,654 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1391652.0, ans=0.125 2023-06-25 18:23:18,113 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1391712.0, ans=0.1 2023-06-25 18:23:26,129 INFO [train.py:996] (2/4) Epoch 8, batch 18500, loss[loss=0.1747, simple_loss=0.2453, pruned_loss=0.05204, over 17990.00 frames. ], tot_loss[loss=0.2051, simple_loss=0.2832, pruned_loss=0.0635, over 4264065.20 frames. ], batch size: 68, lr: 3.70e-03, grad_scale: 32.0 2023-06-25 18:23:45,432 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1391772.0, ans=0.125 2023-06-25 18:23:47,445 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1391772.0, ans=0.0 2023-06-25 18:23:48,759 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1391772.0, ans=0.1 2023-06-25 18:23:52,031 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1391832.0, ans=0.125 2023-06-25 18:24:07,382 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.590e+02 3.343e+02 4.214e+02 5.911e+02 1.246e+03, threshold=8.429e+02, percent-clipped=4.0 2023-06-25 18:24:45,915 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1391952.0, ans=0.0 2023-06-25 18:25:09,787 INFO [train.py:996] (2/4) Epoch 8, batch 18550, loss[loss=0.2088, simple_loss=0.2739, pruned_loss=0.07188, over 21532.00 frames. ], tot_loss[loss=0.203, simple_loss=0.2805, pruned_loss=0.06278, over 4260366.84 frames. ], batch size: 414, lr: 3.70e-03, grad_scale: 32.0 2023-06-25 18:25:46,399 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1392132.0, ans=0.125 2023-06-25 18:27:04,660 INFO [train.py:996] (2/4) Epoch 8, batch 18600, loss[loss=0.2018, simple_loss=0.2585, pruned_loss=0.07251, over 20104.00 frames. ], tot_loss[loss=0.2031, simple_loss=0.2795, pruned_loss=0.06338, over 4265458.62 frames. ], batch size: 703, lr: 3.70e-03, grad_scale: 16.0 2023-06-25 18:27:15,313 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1392372.0, ans=0.5 2023-06-25 18:27:20,570 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1392432.0, ans=0.1 2023-06-25 18:27:36,464 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.831e+02 3.804e+02 5.092e+02 7.468e+02 1.783e+03, threshold=1.018e+03, percent-clipped=18.0 2023-06-25 18:27:47,131 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1392492.0, ans=0.125 2023-06-25 18:28:10,516 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1392552.0, ans=0.0 2023-06-25 18:28:24,039 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1392612.0, ans=0.2 2023-06-25 18:28:33,924 INFO [train.py:996] (2/4) Epoch 8, batch 18650, loss[loss=0.2051, simple_loss=0.2757, pruned_loss=0.06726, over 21814.00 frames. ], tot_loss[loss=0.203, simple_loss=0.2785, pruned_loss=0.06377, over 4258061.26 frames. ], batch size: 352, lr: 3.70e-03, grad_scale: 16.0 2023-06-25 18:28:54,366 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1392672.0, ans=0.125 2023-06-25 18:29:24,796 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.39 vs. limit=15.0 2023-06-25 18:29:49,760 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1392852.0, ans=0.0 2023-06-25 18:30:16,041 INFO [train.py:996] (2/4) Epoch 8, batch 18700, loss[loss=0.1997, simple_loss=0.2755, pruned_loss=0.06199, over 21739.00 frames. ], tot_loss[loss=0.2031, simple_loss=0.2762, pruned_loss=0.06498, over 4265727.20 frames. ], batch size: 112, lr: 3.70e-03, grad_scale: 16.0 2023-06-25 18:30:17,244 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.32 vs. limit=22.5 2023-06-25 18:30:38,265 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.23 vs. limit=12.0 2023-06-25 18:30:54,453 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1393032.0, ans=0.0 2023-06-25 18:31:04,139 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.953e+02 3.708e+02 4.986e+02 6.996e+02 1.849e+03, threshold=9.973e+02, percent-clipped=6.0 2023-06-25 18:31:29,135 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1393152.0, ans=0.125 2023-06-25 18:31:41,790 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1393152.0, ans=0.0 2023-06-25 18:31:54,540 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.29 vs. limit=6.0 2023-06-25 18:31:55,539 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1393212.0, ans=0.0 2023-06-25 18:32:03,267 INFO [train.py:996] (2/4) Epoch 8, batch 18750, loss[loss=0.2159, simple_loss=0.2981, pruned_loss=0.06691, over 21634.00 frames. ], tot_loss[loss=0.2066, simple_loss=0.2795, pruned_loss=0.06688, over 4258924.03 frames. ], batch size: 230, lr: 3.70e-03, grad_scale: 16.0 2023-06-25 18:32:23,051 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1393272.0, ans=0.0 2023-06-25 18:32:53,114 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1393392.0, ans=0.125 2023-06-25 18:33:09,624 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.88 vs. limit=10.0 2023-06-25 18:33:43,755 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1393512.0, ans=0.2 2023-06-25 18:33:47,109 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1393572.0, ans=0.125 2023-06-25 18:33:48,494 INFO [train.py:996] (2/4) Epoch 8, batch 18800, loss[loss=0.1578, simple_loss=0.2201, pruned_loss=0.04777, over 16246.00 frames. ], tot_loss[loss=0.2098, simple_loss=0.2844, pruned_loss=0.0676, over 4244500.70 frames. ], batch size: 60, lr: 3.70e-03, grad_scale: 32.0 2023-06-25 18:33:49,139 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1393572.0, ans=0.125 2023-06-25 18:34:18,971 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1393632.0, ans=0.0 2023-06-25 18:34:23,884 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1393632.0, ans=0.125 2023-06-25 18:34:31,896 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.846e+02 4.247e+02 5.340e+02 7.897e+02 1.499e+03, threshold=1.068e+03, percent-clipped=10.0 2023-06-25 18:34:46,922 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.54 vs. limit=6.0 2023-06-25 18:35:21,951 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.67 vs. limit=15.0 2023-06-25 18:35:31,318 INFO [train.py:996] (2/4) Epoch 8, batch 18850, loss[loss=0.2026, simple_loss=0.2729, pruned_loss=0.06618, over 21498.00 frames. ], tot_loss[loss=0.2034, simple_loss=0.2795, pruned_loss=0.06363, over 4240145.48 frames. ], batch size: 442, lr: 3.70e-03, grad_scale: 32.0 2023-06-25 18:35:32,103 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1393872.0, ans=0.125 2023-06-25 18:36:46,244 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1394052.0, ans=0.125 2023-06-25 18:36:51,015 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1394052.0, ans=0.2 2023-06-25 18:37:00,378 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1394112.0, ans=0.0 2023-06-25 18:37:03,606 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1394112.0, ans=0.0 2023-06-25 18:37:15,900 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1394112.0, ans=0.035 2023-06-25 18:37:18,072 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.61 vs. limit=22.5 2023-06-25 18:37:18,485 INFO [train.py:996] (2/4) Epoch 8, batch 18900, loss[loss=0.1832, simple_loss=0.2548, pruned_loss=0.05583, over 21845.00 frames. ], tot_loss[loss=0.2028, simple_loss=0.2768, pruned_loss=0.06434, over 4254841.03 frames. ], batch size: 373, lr: 3.70e-03, grad_scale: 16.0 2023-06-25 18:37:35,664 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1394172.0, ans=0.125 2023-06-25 18:38:09,195 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.328e+02 3.589e+02 4.833e+02 6.205e+02 1.384e+03, threshold=9.667e+02, percent-clipped=4.0 2023-06-25 18:39:07,655 INFO [train.py:996] (2/4) Epoch 8, batch 18950, loss[loss=0.2275, simple_loss=0.3229, pruned_loss=0.06607, over 21733.00 frames. ], tot_loss[loss=0.205, simple_loss=0.2777, pruned_loss=0.06611, over 4264630.84 frames. ], batch size: 298, lr: 3.70e-03, grad_scale: 16.0 2023-06-25 18:39:17,187 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1394472.0, ans=0.125 2023-06-25 18:39:43,652 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1394532.0, ans=0.125 2023-06-25 18:40:04,497 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1394592.0, ans=0.125 2023-06-25 18:40:04,507 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1394592.0, ans=0.2 2023-06-25 18:41:08,068 INFO [train.py:996] (2/4) Epoch 8, batch 19000, loss[loss=0.2464, simple_loss=0.3235, pruned_loss=0.08466, over 21750.00 frames. ], tot_loss[loss=0.2122, simple_loss=0.2884, pruned_loss=0.068, over 4270801.55 frames. ], batch size: 332, lr: 3.70e-03, grad_scale: 16.0 2023-06-25 18:41:17,827 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 18:41:34,559 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1394832.0, ans=0.125 2023-06-25 18:41:40,563 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.88 vs. limit=15.0 2023-06-25 18:41:47,837 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.853e+02 4.722e+02 6.033e+02 9.741e+02 2.203e+03, threshold=1.207e+03, percent-clipped=24.0 2023-06-25 18:41:48,554 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1394892.0, ans=0.0 2023-06-25 18:41:57,484 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1394892.0, ans=0.0 2023-06-25 18:42:02,404 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 18:42:17,216 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 18:42:56,841 INFO [train.py:996] (2/4) Epoch 8, batch 19050, loss[loss=0.211, simple_loss=0.2757, pruned_loss=0.07318, over 21348.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2932, pruned_loss=0.07102, over 4275228.74 frames. ], batch size: 176, lr: 3.70e-03, grad_scale: 16.0 2023-06-25 18:43:16,011 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1395132.0, ans=0.125 2023-06-25 18:43:21,464 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1395132.0, ans=0.125 2023-06-25 18:43:40,244 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1395192.0, ans=0.1 2023-06-25 18:43:50,215 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1395192.0, ans=0.125 2023-06-25 18:44:10,462 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.09 vs. limit=15.0 2023-06-25 18:44:10,636 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.46 vs. limit=15.0 2023-06-25 18:44:14,682 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1395312.0, ans=0.1 2023-06-25 18:44:44,107 INFO [train.py:996] (2/4) Epoch 8, batch 19100, loss[loss=0.1973, simple_loss=0.2652, pruned_loss=0.06469, over 21780.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.2923, pruned_loss=0.07203, over 4272327.09 frames. ], batch size: 371, lr: 3.70e-03, grad_scale: 16.0 2023-06-25 18:44:48,215 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1395372.0, ans=0.5 2023-06-25 18:45:08,196 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1395432.0, ans=0.2 2023-06-25 18:45:19,983 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.078e+02 4.021e+02 4.752e+02 6.454e+02 2.086e+03, threshold=9.504e+02, percent-clipped=4.0 2023-06-25 18:45:32,846 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1395492.0, ans=0.125 2023-06-25 18:46:20,552 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1395612.0, ans=0.125 2023-06-25 18:46:20,568 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1395612.0, ans=0.125 2023-06-25 18:46:30,730 INFO [train.py:996] (2/4) Epoch 8, batch 19150, loss[loss=0.2254, simple_loss=0.3126, pruned_loss=0.06911, over 20871.00 frames. ], tot_loss[loss=0.2192, simple_loss=0.2933, pruned_loss=0.07258, over 4264222.51 frames. ], batch size: 608, lr: 3.70e-03, grad_scale: 16.0 2023-06-25 18:46:39,839 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.62 vs. limit=10.0 2023-06-25 18:47:13,813 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1395792.0, ans=0.125 2023-06-25 18:47:31,083 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 18:48:05,828 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1395912.0, ans=0.2 2023-06-25 18:48:09,660 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1395912.0, ans=0.05 2023-06-25 18:48:19,915 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1395972.0, ans=0.04949747468305833 2023-06-25 18:48:21,122 INFO [train.py:996] (2/4) Epoch 8, batch 19200, loss[loss=0.279, simple_loss=0.3867, pruned_loss=0.08563, over 21224.00 frames. ], tot_loss[loss=0.2271, simple_loss=0.3056, pruned_loss=0.07431, over 4269848.60 frames. ], batch size: 549, lr: 3.70e-03, grad_scale: 32.0 2023-06-25 18:49:00,520 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.955e+02 4.242e+02 5.606e+02 9.141e+02 1.658e+03, threshold=1.121e+03, percent-clipped=22.0 2023-06-25 18:49:04,715 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1396092.0, ans=0.125 2023-06-25 18:49:06,469 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1396092.0, ans=0.1 2023-06-25 18:49:12,228 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.21 vs. limit=6.0 2023-06-25 18:50:01,409 INFO [train.py:996] (2/4) Epoch 8, batch 19250, loss[loss=0.1948, simple_loss=0.286, pruned_loss=0.05178, over 21418.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.3049, pruned_loss=0.06966, over 4266182.59 frames. ], batch size: 548, lr: 3.70e-03, grad_scale: 32.0 2023-06-25 18:50:15,996 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1396272.0, ans=0.0 2023-06-25 18:50:36,182 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1396332.0, ans=0.125 2023-06-25 18:50:51,960 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1396392.0, ans=0.125 2023-06-25 18:50:56,947 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 18:50:58,977 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1396452.0, ans=0.125 2023-06-25 18:51:44,987 INFO [train.py:996] (2/4) Epoch 8, batch 19300, loss[loss=0.2379, simple_loss=0.3041, pruned_loss=0.08585, over 21627.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.3027, pruned_loss=0.06917, over 4269242.47 frames. ], batch size: 195, lr: 3.70e-03, grad_scale: 32.0 2023-06-25 18:52:28,397 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.307e+02 3.708e+02 5.763e+02 8.363e+02 1.771e+03, threshold=1.153e+03, percent-clipped=11.0 2023-06-25 18:53:37,208 INFO [train.py:996] (2/4) Epoch 8, batch 19350, loss[loss=0.2029, simple_loss=0.2916, pruned_loss=0.05713, over 21851.00 frames. ], tot_loss[loss=0.2137, simple_loss=0.2966, pruned_loss=0.06544, over 4272452.06 frames. ], batch size: 373, lr: 3.70e-03, grad_scale: 32.0 2023-06-25 18:53:38,192 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1396872.0, ans=0.125 2023-06-25 18:54:47,612 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1397052.0, ans=0.0 2023-06-25 18:55:22,289 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1397112.0, ans=0.0 2023-06-25 18:55:25,392 INFO [train.py:996] (2/4) Epoch 8, batch 19400, loss[loss=0.1966, simple_loss=0.2758, pruned_loss=0.05876, over 21803.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.294, pruned_loss=0.06464, over 4268412.12 frames. ], batch size: 298, lr: 3.70e-03, grad_scale: 16.0 2023-06-25 18:55:54,174 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1397232.0, ans=0.07 2023-06-25 18:55:56,195 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.10 vs. limit=15.0 2023-06-25 18:55:57,266 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1397232.0, ans=0.1 2023-06-25 18:55:59,277 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1397232.0, ans=0.0 2023-06-25 18:56:07,071 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.659e+02 3.804e+02 4.878e+02 6.968e+02 1.951e+03, threshold=9.756e+02, percent-clipped=7.0 2023-06-25 18:56:25,167 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1397292.0, ans=0.0 2023-06-25 18:56:40,966 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1397352.0, ans=0.125 2023-06-25 18:56:51,390 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1397352.0, ans=0.0 2023-06-25 18:56:52,020 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.69 vs. limit=15.0 2023-06-25 18:57:01,939 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1397412.0, ans=0.1 2023-06-25 18:57:13,665 INFO [train.py:996] (2/4) Epoch 8, batch 19450, loss[loss=0.1881, simple_loss=0.2617, pruned_loss=0.05728, over 21733.00 frames. ], tot_loss[loss=0.2123, simple_loss=0.2917, pruned_loss=0.06644, over 4278934.66 frames. ], batch size: 298, lr: 3.70e-03, grad_scale: 16.0 2023-06-25 18:57:39,167 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1397532.0, ans=0.125 2023-06-25 18:57:50,485 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.16 vs. limit=12.0 2023-06-25 18:57:51,550 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1397532.0, ans=0.0 2023-06-25 18:57:53,472 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 18:58:26,193 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1397652.0, ans=0.0 2023-06-25 18:58:45,900 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1397712.0, ans=0.025 2023-06-25 18:59:01,910 INFO [train.py:996] (2/4) Epoch 8, batch 19500, loss[loss=0.2156, simple_loss=0.296, pruned_loss=0.0676, over 21580.00 frames. ], tot_loss[loss=0.211, simple_loss=0.2875, pruned_loss=0.06727, over 4282163.10 frames. ], batch size: 389, lr: 3.69e-03, grad_scale: 16.0 2023-06-25 18:59:47,841 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.224e+02 4.180e+02 5.667e+02 7.986e+02 1.317e+03, threshold=1.133e+03, percent-clipped=13.0 2023-06-25 18:59:50,377 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1397892.0, ans=0.2 2023-06-25 18:59:55,290 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1397892.0, ans=0.04949747468305833 2023-06-25 18:59:57,375 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.74 vs. limit=22.5 2023-06-25 19:00:26,372 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1398012.0, ans=0.125 2023-06-25 19:00:42,904 INFO [train.py:996] (2/4) Epoch 8, batch 19550, loss[loss=0.1962, simple_loss=0.2867, pruned_loss=0.05282, over 21845.00 frames. ], tot_loss[loss=0.2071, simple_loss=0.2825, pruned_loss=0.06583, over 4275808.53 frames. ], batch size: 371, lr: 3.69e-03, grad_scale: 16.0 2023-06-25 19:01:06,575 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.87 vs. limit=10.0 2023-06-25 19:01:15,472 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1398132.0, ans=0.0 2023-06-25 19:01:52,771 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1398252.0, ans=0.0 2023-06-25 19:02:08,320 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1398312.0, ans=0.05 2023-06-25 19:02:23,101 INFO [train.py:996] (2/4) Epoch 8, batch 19600, loss[loss=0.2443, simple_loss=0.3088, pruned_loss=0.08991, over 21319.00 frames. ], tot_loss[loss=0.2091, simple_loss=0.2845, pruned_loss=0.06687, over 4286477.74 frames. ], batch size: 159, lr: 3.69e-03, grad_scale: 32.0 2023-06-25 19:02:28,789 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1398372.0, ans=0.2 2023-06-25 19:02:53,501 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.11 vs. limit=15.0 2023-06-25 19:02:53,536 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.63 vs. limit=6.0 2023-06-25 19:02:57,901 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1398432.0, ans=0.125 2023-06-25 19:03:02,933 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1398492.0, ans=0.2 2023-06-25 19:03:10,846 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.701e+02 4.318e+02 6.046e+02 9.838e+02 1.787e+03, threshold=1.209e+03, percent-clipped=19.0 2023-06-25 19:04:06,671 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.09 vs. limit=10.0 2023-06-25 19:04:10,728 INFO [train.py:996] (2/4) Epoch 8, batch 19650, loss[loss=0.2235, simple_loss=0.3255, pruned_loss=0.06075, over 19879.00 frames. ], tot_loss[loss=0.2156, simple_loss=0.2899, pruned_loss=0.07068, over 4286734.58 frames. ], batch size: 702, lr: 3.69e-03, grad_scale: 16.0 2023-06-25 19:04:25,453 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1398672.0, ans=0.0 2023-06-25 19:04:49,198 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.29 vs. limit=15.0 2023-06-25 19:05:41,549 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1398852.0, ans=0.125 2023-06-25 19:06:06,304 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1398972.0, ans=0.125 2023-06-25 19:06:07,327 INFO [train.py:996] (2/4) Epoch 8, batch 19700, loss[loss=0.2104, simple_loss=0.3043, pruned_loss=0.05827, over 21725.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.2918, pruned_loss=0.07103, over 4281221.80 frames. ], batch size: 352, lr: 3.69e-03, grad_scale: 16.0 2023-06-25 19:06:38,187 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1399032.0, ans=0.125 2023-06-25 19:06:51,202 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=1399032.0, ans=0.5 2023-06-25 19:06:57,334 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.73 vs. limit=15.0 2023-06-25 19:07:03,160 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.726e+02 4.245e+02 5.228e+02 6.853e+02 1.147e+03, threshold=1.046e+03, percent-clipped=0.0 2023-06-25 19:07:14,963 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.35 vs. limit=15.0 2023-06-25 19:07:26,884 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.11 vs. limit=15.0 2023-06-25 19:08:00,690 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1399272.0, ans=0.1 2023-06-25 19:08:01,681 INFO [train.py:996] (2/4) Epoch 8, batch 19750, loss[loss=0.222, simple_loss=0.3154, pruned_loss=0.0643, over 21595.00 frames. ], tot_loss[loss=0.223, simple_loss=0.3011, pruned_loss=0.07244, over 4278693.71 frames. ], batch size: 230, lr: 3.69e-03, grad_scale: 16.0 2023-06-25 19:08:54,752 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1399392.0, ans=0.0 2023-06-25 19:09:13,856 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1399452.0, ans=0.1 2023-06-25 19:09:40,185 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.64 vs. limit=6.0 2023-06-25 19:09:43,300 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.81 vs. limit=22.5 2023-06-25 19:09:54,468 INFO [train.py:996] (2/4) Epoch 8, batch 19800, loss[loss=0.2349, simple_loss=0.3168, pruned_loss=0.07652, over 21512.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.3015, pruned_loss=0.07276, over 4277222.25 frames. ], batch size: 471, lr: 3.69e-03, grad_scale: 16.0 2023-06-25 19:09:59,357 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.37 vs. limit=15.0 2023-06-25 19:10:00,418 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1399572.0, ans=0.5 2023-06-25 19:10:07,197 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1399572.0, ans=0.1 2023-06-25 19:10:14,372 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1399632.0, ans=0.125 2023-06-25 19:10:38,209 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.249e+02 4.512e+02 5.932e+02 8.767e+02 2.271e+03, threshold=1.186e+03, percent-clipped=19.0 2023-06-25 19:10:44,239 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1399692.0, ans=0.125 2023-06-25 19:11:42,800 INFO [train.py:996] (2/4) Epoch 8, batch 19850, loss[loss=0.1999, simple_loss=0.2968, pruned_loss=0.05157, over 21759.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2943, pruned_loss=0.06812, over 4275664.78 frames. ], batch size: 332, lr: 3.69e-03, grad_scale: 16.0 2023-06-25 19:11:55,545 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1399872.0, ans=0.0 2023-06-25 19:12:04,580 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1399932.0, ans=10.0 2023-06-25 19:12:11,199 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1399932.0, ans=0.0 2023-06-25 19:12:42,037 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1399992.0, ans=0.0 2023-06-25 19:13:04,703 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.65 vs. limit=22.5 2023-06-25 19:13:28,595 INFO [train.py:996] (2/4) Epoch 8, batch 19900, loss[loss=0.1979, simple_loss=0.2766, pruned_loss=0.05961, over 21720.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.2942, pruned_loss=0.06564, over 4276560.27 frames. ], batch size: 316, lr: 3.69e-03, grad_scale: 16.0 2023-06-25 19:13:57,482 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1400232.0, ans=0.2 2023-06-25 19:14:17,383 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.623e+02 3.554e+02 4.496e+02 7.903e+02 1.499e+03, threshold=8.992e+02, percent-clipped=4.0 2023-06-25 19:14:28,651 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1400292.0, ans=0.125 2023-06-25 19:15:17,770 INFO [train.py:996] (2/4) Epoch 8, batch 19950, loss[loss=0.1628, simple_loss=0.2347, pruned_loss=0.04544, over 21338.00 frames. ], tot_loss[loss=0.2096, simple_loss=0.2883, pruned_loss=0.06544, over 4275038.74 frames. ], batch size: 131, lr: 3.69e-03, grad_scale: 16.0 2023-06-25 19:15:39,122 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.40 vs. limit=15.0 2023-06-25 19:15:46,252 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.46 vs. limit=15.0 2023-06-25 19:15:49,235 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.15 vs. limit=10.0 2023-06-25 19:15:53,902 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 19:16:09,501 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 19:16:55,311 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1400712.0, ans=0.125 2023-06-25 19:16:56,916 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1400712.0, ans=0.1 2023-06-25 19:16:58,866 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.21 vs. limit=15.0 2023-06-25 19:17:12,744 INFO [train.py:996] (2/4) Epoch 8, batch 20000, loss[loss=0.2158, simple_loss=0.2894, pruned_loss=0.07106, over 21847.00 frames. ], tot_loss[loss=0.2108, simple_loss=0.2888, pruned_loss=0.06638, over 4262885.62 frames. ], batch size: 332, lr: 3.69e-03, grad_scale: 32.0 2023-06-25 19:17:55,686 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.991e+02 3.942e+02 5.343e+02 7.186e+02 1.508e+03, threshold=1.069e+03, percent-clipped=12.0 2023-06-25 19:18:18,803 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.26 vs. limit=12.0 2023-06-25 19:18:33,340 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1400952.0, ans=0.0 2023-06-25 19:18:34,931 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1401012.0, ans=0.1 2023-06-25 19:18:40,113 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1401012.0, ans=0.125 2023-06-25 19:18:40,228 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1401012.0, ans=0.125 2023-06-25 19:18:41,904 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1401012.0, ans=0.0 2023-06-25 19:18:47,485 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.15 vs. limit=12.0 2023-06-25 19:18:58,685 INFO [train.py:996] (2/4) Epoch 8, batch 20050, loss[loss=0.2268, simple_loss=0.2977, pruned_loss=0.07801, over 21298.00 frames. ], tot_loss[loss=0.2142, simple_loss=0.2911, pruned_loss=0.06864, over 4269985.41 frames. ], batch size: 143, lr: 3.69e-03, grad_scale: 16.0 2023-06-25 19:19:13,406 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1401072.0, ans=0.125 2023-06-25 19:19:30,923 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1401132.0, ans=0.05 2023-06-25 19:20:48,085 INFO [train.py:996] (2/4) Epoch 8, batch 20100, loss[loss=0.2173, simple_loss=0.2948, pruned_loss=0.06985, over 21394.00 frames. ], tot_loss[loss=0.2167, simple_loss=0.293, pruned_loss=0.07017, over 4278076.68 frames. ], batch size: 159, lr: 3.69e-03, grad_scale: 16.0 2023-06-25 19:20:48,581 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1401372.0, ans=0.125 2023-06-25 19:20:55,801 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1401372.0, ans=0.125 2023-06-25 19:21:04,912 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=1401432.0, ans=15.0 2023-06-25 19:21:09,545 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1401432.0, ans=0.1 2023-06-25 19:21:32,548 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 19:21:33,510 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.930e+02 3.809e+02 4.961e+02 6.304e+02 1.570e+03, threshold=9.921e+02, percent-clipped=3.0 2023-06-25 19:21:53,624 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1401492.0, ans=0.2 2023-06-25 19:22:34,574 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.94 vs. limit=15.0 2023-06-25 19:22:38,480 INFO [train.py:996] (2/4) Epoch 8, batch 20150, loss[loss=0.2725, simple_loss=0.3504, pruned_loss=0.0973, over 21541.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.3043, pruned_loss=0.074, over 4280580.23 frames. ], batch size: 414, lr: 3.69e-03, grad_scale: 16.0 2023-06-25 19:22:47,415 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.94 vs. limit=10.0 2023-06-25 19:23:02,714 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1401732.0, ans=0.0 2023-06-25 19:23:34,872 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1401792.0, ans=0.1 2023-06-25 19:23:41,059 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.71 vs. limit=15.0 2023-06-25 19:23:42,671 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1401792.0, ans=0.0 2023-06-25 19:24:35,467 INFO [train.py:996] (2/4) Epoch 8, batch 20200, loss[loss=0.2045, simple_loss=0.2918, pruned_loss=0.05859, over 21277.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3096, pruned_loss=0.07727, over 4276612.09 frames. ], batch size: 176, lr: 3.69e-03, grad_scale: 16.0 2023-06-25 19:24:40,247 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.02 vs. limit=6.0 2023-06-25 19:25:16,660 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.08 vs. limit=10.0 2023-06-25 19:25:24,732 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1402092.0, ans=0.1 2023-06-25 19:25:25,539 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.344e+02 4.250e+02 5.853e+02 8.923e+02 1.822e+03, threshold=1.171e+03, percent-clipped=17.0 2023-06-25 19:25:59,982 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.73 vs. limit=15.0 2023-06-25 19:26:17,420 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.09 vs. limit=6.0 2023-06-25 19:26:18,750 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1402212.0, ans=0.1 2023-06-25 19:26:23,010 INFO [train.py:996] (2/4) Epoch 8, batch 20250, loss[loss=0.204, simple_loss=0.2821, pruned_loss=0.0629, over 21665.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.3109, pruned_loss=0.07576, over 4271768.54 frames. ], batch size: 230, lr: 3.69e-03, grad_scale: 16.0 2023-06-25 19:26:36,111 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1402272.0, ans=0.2 2023-06-25 19:26:48,292 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1402272.0, ans=0.1 2023-06-25 19:26:58,853 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1402332.0, ans=0.125 2023-06-25 19:27:30,587 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.76 vs. limit=15.0 2023-06-25 19:27:35,161 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1402452.0, ans=0.125 2023-06-25 19:27:59,449 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1402512.0, ans=0.2 2023-06-25 19:28:01,193 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1402512.0, ans=0.1 2023-06-25 19:28:15,779 INFO [train.py:996] (2/4) Epoch 8, batch 20300, loss[loss=0.21, simple_loss=0.2899, pruned_loss=0.06507, over 21263.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.3079, pruned_loss=0.07241, over 4269303.80 frames. ], batch size: 176, lr: 3.69e-03, grad_scale: 16.0 2023-06-25 19:28:58,613 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.664e+02 3.704e+02 5.083e+02 7.003e+02 2.093e+03, threshold=1.017e+03, percent-clipped=9.0 2023-06-25 19:29:28,816 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1402752.0, ans=0.2 2023-06-25 19:29:56,324 INFO [train.py:996] (2/4) Epoch 8, batch 20350, loss[loss=0.2374, simple_loss=0.3106, pruned_loss=0.08207, over 21810.00 frames. ], tot_loss[loss=0.228, simple_loss=0.309, pruned_loss=0.07345, over 4268416.06 frames. ], batch size: 124, lr: 3.69e-03, grad_scale: 16.0 2023-06-25 19:31:17,173 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1403052.0, ans=0.125 2023-06-25 19:31:44,536 INFO [train.py:996] (2/4) Epoch 8, batch 20400, loss[loss=0.2374, simple_loss=0.3187, pruned_loss=0.07809, over 21788.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.3108, pruned_loss=0.07548, over 4260264.73 frames. ], batch size: 332, lr: 3.69e-03, grad_scale: 32.0 2023-06-25 19:32:14,513 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 19:32:32,917 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff2.min_abs, batch_count=1403292.0, ans=0.1 2023-06-25 19:32:33,896 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.980e+02 4.155e+02 6.028e+02 7.732e+02 1.561e+03, threshold=1.206e+03, percent-clipped=8.0 2023-06-25 19:32:42,083 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1403292.0, ans=0.125 2023-06-25 19:33:31,094 INFO [train.py:996] (2/4) Epoch 8, batch 20450, loss[loss=0.2298, simple_loss=0.3003, pruned_loss=0.07964, over 21510.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.312, pruned_loss=0.07776, over 4261351.51 frames. ], batch size: 194, lr: 3.69e-03, grad_scale: 32.0 2023-06-25 19:33:31,909 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1403472.0, ans=0.1 2023-06-25 19:33:53,588 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1403472.0, ans=0.125 2023-06-25 19:34:17,247 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.82 vs. limit=6.0 2023-06-25 19:34:44,674 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1403652.0, ans=0.2 2023-06-25 19:35:16,946 INFO [train.py:996] (2/4) Epoch 8, batch 20500, loss[loss=0.191, simple_loss=0.2644, pruned_loss=0.05881, over 21691.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.3061, pruned_loss=0.07734, over 4259372.12 frames. ], batch size: 247, lr: 3.69e-03, grad_scale: 16.0 2023-06-25 19:35:19,422 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1403772.0, ans=0.125 2023-06-25 19:35:29,696 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1403772.0, ans=0.125 2023-06-25 19:36:07,744 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.844e+02 4.072e+02 6.125e+02 8.287e+02 1.348e+03, threshold=1.225e+03, percent-clipped=6.0 2023-06-25 19:36:10,190 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1403892.0, ans=0.125 2023-06-25 19:36:27,815 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1403952.0, ans=0.1 2023-06-25 19:36:35,104 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1403952.0, ans=0.125 2023-06-25 19:37:09,436 INFO [train.py:996] (2/4) Epoch 8, batch 20550, loss[loss=0.2273, simple_loss=0.3136, pruned_loss=0.0705, over 21573.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.298, pruned_loss=0.07546, over 4249656.89 frames. ], batch size: 389, lr: 3.69e-03, grad_scale: 16.0 2023-06-25 19:37:17,038 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1404072.0, ans=0.0 2023-06-25 19:37:28,769 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1404072.0, ans=0.125 2023-06-25 19:37:30,483 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1404132.0, ans=0.125 2023-06-25 19:38:56,994 INFO [train.py:996] (2/4) Epoch 8, batch 20600, loss[loss=0.2146, simple_loss=0.2824, pruned_loss=0.07343, over 21031.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.2999, pruned_loss=0.07271, over 4247190.45 frames. ], batch size: 607, lr: 3.69e-03, grad_scale: 16.0 2023-06-25 19:38:59,084 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1404372.0, ans=0.0 2023-06-25 19:39:19,397 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1404432.0, ans=0.1 2023-06-25 19:39:34,437 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1404492.0, ans=0.0 2023-06-25 19:39:42,070 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.887e+02 4.920e+02 7.013e+02 1.215e+03 1.791e+03, threshold=1.403e+03, percent-clipped=24.0 2023-06-25 19:40:42,125 INFO [train.py:996] (2/4) Epoch 8, batch 20650, loss[loss=0.1852, simple_loss=0.252, pruned_loss=0.05918, over 21593.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.2967, pruned_loss=0.07331, over 4256143.68 frames. ], batch size: 231, lr: 3.69e-03, grad_scale: 16.0 2023-06-25 19:40:45,981 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1404672.0, ans=0.125 2023-06-25 19:41:33,763 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.24 vs. limit=15.0 2023-06-25 19:41:35,021 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1404792.0, ans=0.2 2023-06-25 19:41:53,110 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1404852.0, ans=0.1 2023-06-25 19:42:31,282 INFO [train.py:996] (2/4) Epoch 8, batch 20700, loss[loss=0.2433, simple_loss=0.323, pruned_loss=0.08174, over 21590.00 frames. ], tot_loss[loss=0.2147, simple_loss=0.2891, pruned_loss=0.07014, over 4260001.68 frames. ], batch size: 441, lr: 3.69e-03, grad_scale: 8.0 2023-06-25 19:42:32,538 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.21 vs. limit=15.0 2023-06-25 19:42:54,884 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1405032.0, ans=0.125 2023-06-25 19:43:09,591 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 19:43:27,035 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.481e+02 3.647e+02 4.600e+02 6.617e+02 1.302e+03, threshold=9.199e+02, percent-clipped=0.0 2023-06-25 19:44:23,977 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.12 vs. limit=22.5 2023-06-25 19:44:27,726 INFO [train.py:996] (2/4) Epoch 8, batch 20750, loss[loss=0.3203, simple_loss=0.409, pruned_loss=0.1157, over 21511.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2919, pruned_loss=0.06962, over 4259316.64 frames. ], batch size: 471, lr: 3.69e-03, grad_scale: 8.0 2023-06-25 19:44:37,481 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1405272.0, ans=10.0 2023-06-25 19:45:02,306 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1405332.0, ans=0.07 2023-06-25 19:45:31,602 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1405452.0, ans=0.125 2023-06-25 19:46:00,988 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1405512.0, ans=0.0 2023-06-25 19:46:01,014 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 19:46:16,253 INFO [train.py:996] (2/4) Epoch 8, batch 20800, loss[loss=0.2101, simple_loss=0.2711, pruned_loss=0.07449, over 19944.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.2953, pruned_loss=0.07026, over 4263782.74 frames. ], batch size: 703, lr: 3.68e-03, grad_scale: 16.0 2023-06-25 19:46:23,731 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1405572.0, ans=0.125 2023-06-25 19:46:44,657 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1405632.0, ans=0.125 2023-06-25 19:47:10,350 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.060e+02 4.318e+02 7.506e+02 1.059e+03 2.434e+03, threshold=1.501e+03, percent-clipped=34.0 2023-06-25 19:47:41,497 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1405812.0, ans=0.0 2023-06-25 19:47:50,086 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1405812.0, ans=0.125 2023-06-25 19:48:02,819 INFO [train.py:996] (2/4) Epoch 8, batch 20850, loss[loss=0.1956, simple_loss=0.2733, pruned_loss=0.05892, over 21776.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.2887, pruned_loss=0.06842, over 4261681.52 frames. ], batch size: 414, lr: 3.68e-03, grad_scale: 16.0 2023-06-25 19:48:33,095 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.03 vs. limit=15.0 2023-06-25 19:48:34,782 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.83 vs. limit=12.0 2023-06-25 19:49:47,591 INFO [train.py:996] (2/4) Epoch 8, batch 20900, loss[loss=0.2192, simple_loss=0.3052, pruned_loss=0.06664, over 21931.00 frames. ], tot_loss[loss=0.2147, simple_loss=0.29, pruned_loss=0.06965, over 4271184.59 frames. ], batch size: 373, lr: 3.68e-03, grad_scale: 16.0 2023-06-25 19:50:34,299 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.696e+02 3.719e+02 4.943e+02 7.397e+02 1.417e+03, threshold=9.886e+02, percent-clipped=0.0 2023-06-25 19:50:40,322 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1406292.0, ans=0.1 2023-06-25 19:51:11,484 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1406412.0, ans=0.1 2023-06-25 19:51:30,990 INFO [train.py:996] (2/4) Epoch 8, batch 20950, loss[loss=0.1749, simple_loss=0.2553, pruned_loss=0.04727, over 21904.00 frames. ], tot_loss[loss=0.2093, simple_loss=0.2858, pruned_loss=0.06635, over 4264012.27 frames. ], batch size: 98, lr: 3.68e-03, grad_scale: 16.0 2023-06-25 19:51:45,094 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1406472.0, ans=0.125 2023-06-25 19:51:48,529 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1406532.0, ans=0.2 2023-06-25 19:52:42,942 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.81 vs. limit=15.0 2023-06-25 19:52:49,355 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.69 vs. limit=15.0 2023-06-25 19:53:11,652 INFO [train.py:996] (2/4) Epoch 8, batch 21000, loss[loss=0.2116, simple_loss=0.2889, pruned_loss=0.06719, over 21222.00 frames. ], tot_loss[loss=0.2101, simple_loss=0.2854, pruned_loss=0.06744, over 4267781.07 frames. ], batch size: 176, lr: 3.68e-03, grad_scale: 16.0 2023-06-25 19:53:11,653 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-25 19:53:31,262 INFO [train.py:1028] (2/4) Epoch 8, validation: loss=0.2635, simple_loss=0.3595, pruned_loss=0.08373, over 1796401.00 frames. 2023-06-25 19:53:31,263 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 23731MB 2023-06-25 19:54:05,559 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.06 vs. limit=15.0 2023-06-25 19:54:24,849 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.517e+02 3.578e+02 4.486e+02 7.087e+02 1.717e+03, threshold=8.972e+02, percent-clipped=7.0 2023-06-25 19:54:30,228 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1406892.0, ans=0.125 2023-06-25 19:54:33,539 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_na.min_abs, batch_count=1406952.0, ans=0.02 2023-06-25 19:55:17,247 INFO [train.py:996] (2/4) Epoch 8, batch 21050, loss[loss=0.2014, simple_loss=0.2689, pruned_loss=0.06695, over 21271.00 frames. ], tot_loss[loss=0.2089, simple_loss=0.2833, pruned_loss=0.06725, over 4269683.69 frames. ], batch size: 159, lr: 3.68e-03, grad_scale: 16.0 2023-06-25 19:56:00,012 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1407192.0, ans=0.125 2023-06-25 19:56:08,723 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1407192.0, ans=0.05 2023-06-25 19:56:18,906 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1407252.0, ans=0.125 2023-06-25 19:56:40,442 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=14.49 vs. limit=15.0 2023-06-25 19:57:02,423 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1407312.0, ans=0.0 2023-06-25 19:57:02,570 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1407312.0, ans=0.04949747468305833 2023-06-25 19:57:05,065 INFO [train.py:996] (2/4) Epoch 8, batch 21100, loss[loss=0.2021, simple_loss=0.2691, pruned_loss=0.06755, over 21525.00 frames. ], tot_loss[loss=0.2054, simple_loss=0.2782, pruned_loss=0.06632, over 4254810.57 frames. ], batch size: 391, lr: 3.68e-03, grad_scale: 16.0 2023-06-25 19:57:45,222 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1407492.0, ans=0.125 2023-06-25 19:57:46,841 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1407492.0, ans=0.04949747468305833 2023-06-25 19:57:57,770 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.898e+02 4.201e+02 5.635e+02 7.939e+02 1.482e+03, threshold=1.127e+03, percent-clipped=15.0 2023-06-25 19:58:49,900 INFO [train.py:996] (2/4) Epoch 8, batch 21150, loss[loss=0.2015, simple_loss=0.263, pruned_loss=0.06998, over 21746.00 frames. ], tot_loss[loss=0.2036, simple_loss=0.2748, pruned_loss=0.06626, over 4258118.04 frames. ], batch size: 300, lr: 3.68e-03, grad_scale: 16.0 2023-06-25 19:59:01,001 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1407672.0, ans=0.2 2023-06-25 19:59:05,876 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1407732.0, ans=0.2 2023-06-25 19:59:42,178 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1407792.0, ans=0.0 2023-06-25 20:00:25,049 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1407912.0, ans=0.04949747468305833 2023-06-25 20:00:38,494 INFO [train.py:996] (2/4) Epoch 8, batch 21200, loss[loss=0.2184, simple_loss=0.2902, pruned_loss=0.07328, over 21967.00 frames. ], tot_loss[loss=0.2015, simple_loss=0.2718, pruned_loss=0.0656, over 4258183.03 frames. ], batch size: 103, lr: 3.68e-03, grad_scale: 32.0 2023-06-25 20:00:39,732 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.23 vs. limit=15.0 2023-06-25 20:00:45,723 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1407972.0, ans=0.125 2023-06-25 20:01:34,474 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.940e+02 3.823e+02 4.703e+02 6.840e+02 1.518e+03, threshold=9.406e+02, percent-clipped=1.0 2023-06-25 20:01:52,527 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1408152.0, ans=0.125 2023-06-25 20:02:15,974 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1408212.0, ans=0.1 2023-06-25 20:02:25,128 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1408272.0, ans=0.0 2023-06-25 20:02:26,021 INFO [train.py:996] (2/4) Epoch 8, batch 21250, loss[loss=0.1979, simple_loss=0.2623, pruned_loss=0.06672, over 21555.00 frames. ], tot_loss[loss=0.2008, simple_loss=0.2702, pruned_loss=0.06571, over 4258541.55 frames. ], batch size: 391, lr: 3.68e-03, grad_scale: 16.0 2023-06-25 20:02:36,714 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1408272.0, ans=0.0 2023-06-25 20:04:05,545 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=1408512.0, ans=0.025 2023-06-25 20:04:11,920 INFO [train.py:996] (2/4) Epoch 8, batch 21300, loss[loss=0.2021, simple_loss=0.2825, pruned_loss=0.06088, over 21490.00 frames. ], tot_loss[loss=0.207, simple_loss=0.2779, pruned_loss=0.06803, over 4262238.25 frames. ], batch size: 212, lr: 3.68e-03, grad_scale: 16.0 2023-06-25 20:04:51,357 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1408632.0, ans=0.125 2023-06-25 20:05:07,872 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.197e+02 4.370e+02 6.934e+02 9.057e+02 1.727e+03, threshold=1.387e+03, percent-clipped=23.0 2023-06-25 20:05:57,866 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1408872.0, ans=0.125 2023-06-25 20:05:58,750 INFO [train.py:996] (2/4) Epoch 8, batch 21350, loss[loss=0.1996, simple_loss=0.2944, pruned_loss=0.0524, over 21766.00 frames. ], tot_loss[loss=0.2096, simple_loss=0.2818, pruned_loss=0.06869, over 4270135.17 frames. ], batch size: 332, lr: 3.68e-03, grad_scale: 16.0 2023-06-25 20:06:02,888 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1408872.0, ans=0.2 2023-06-25 20:07:10,216 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1409052.0, ans=0.1 2023-06-25 20:07:37,908 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1409112.0, ans=0.125 2023-06-25 20:07:45,956 INFO [train.py:996] (2/4) Epoch 8, batch 21400, loss[loss=0.2855, simple_loss=0.358, pruned_loss=0.1065, over 21379.00 frames. ], tot_loss[loss=0.212, simple_loss=0.2856, pruned_loss=0.06922, over 4270787.73 frames. ], batch size: 471, lr: 3.68e-03, grad_scale: 16.0 2023-06-25 20:08:46,534 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.047e+02 3.806e+02 5.030e+02 6.995e+02 1.894e+03, threshold=1.006e+03, percent-clipped=5.0 2023-06-25 20:08:50,192 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1409292.0, ans=0.125 2023-06-25 20:09:28,486 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.58 vs. limit=15.0 2023-06-25 20:09:31,653 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1409472.0, ans=0.025 2023-06-25 20:09:32,594 INFO [train.py:996] (2/4) Epoch 8, batch 21450, loss[loss=0.2553, simple_loss=0.3116, pruned_loss=0.09956, over 21715.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2887, pruned_loss=0.06965, over 4270787.98 frames. ], batch size: 507, lr: 3.68e-03, grad_scale: 16.0 2023-06-25 20:10:10,300 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1409532.0, ans=0.1 2023-06-25 20:10:35,672 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.58 vs. limit=15.0 2023-06-25 20:10:37,239 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.62 vs. limit=22.5 2023-06-25 20:10:48,272 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1409652.0, ans=0.2 2023-06-25 20:11:20,628 INFO [train.py:996] (2/4) Epoch 8, batch 21500, loss[loss=0.2091, simple_loss=0.2733, pruned_loss=0.07242, over 21284.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.2875, pruned_loss=0.07064, over 4267693.33 frames. ], batch size: 159, lr: 3.68e-03, grad_scale: 16.0 2023-06-25 20:11:42,304 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1409772.0, ans=0.125 2023-06-25 20:12:12,995 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1409892.0, ans=0.125 2023-06-25 20:12:25,072 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.002e+02 3.682e+02 4.429e+02 6.594e+02 1.934e+03, threshold=8.857e+02, percent-clipped=12.0 2023-06-25 20:12:29,231 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1409892.0, ans=0.1 2023-06-25 20:13:05,245 INFO [train.py:996] (2/4) Epoch 8, batch 21550, loss[loss=0.2121, simple_loss=0.3277, pruned_loss=0.04824, over 19655.00 frames. ], tot_loss[loss=0.2078, simple_loss=0.2802, pruned_loss=0.06772, over 4256908.54 frames. ], batch size: 703, lr: 3.68e-03, grad_scale: 8.0 2023-06-25 20:14:12,877 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1410192.0, ans=0.1 2023-06-25 20:14:36,613 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1410312.0, ans=0.2 2023-06-25 20:14:53,577 INFO [train.py:996] (2/4) Epoch 8, batch 21600, loss[loss=0.2082, simple_loss=0.3011, pruned_loss=0.05765, over 21579.00 frames. ], tot_loss[loss=0.2044, simple_loss=0.2764, pruned_loss=0.06621, over 4249814.85 frames. ], batch size: 441, lr: 3.68e-03, grad_scale: 16.0 2023-06-25 20:16:02,288 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.883e+02 3.709e+02 4.996e+02 7.825e+02 2.196e+03, threshold=9.991e+02, percent-clipped=18.0 2023-06-25 20:16:16,749 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1410552.0, ans=0.125 2023-06-25 20:16:46,717 INFO [train.py:996] (2/4) Epoch 8, batch 21650, loss[loss=0.2559, simple_loss=0.3525, pruned_loss=0.07966, over 21652.00 frames. ], tot_loss[loss=0.2042, simple_loss=0.28, pruned_loss=0.06423, over 4247619.51 frames. ], batch size: 441, lr: 3.68e-03, grad_scale: 16.0 2023-06-25 20:17:24,174 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1410732.0, ans=0.2 2023-06-25 20:17:30,499 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1410792.0, ans=0.125 2023-06-25 20:18:09,976 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1410912.0, ans=0.125 2023-06-25 20:18:18,629 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.67 vs. limit=15.0 2023-06-25 20:18:25,974 INFO [train.py:996] (2/4) Epoch 8, batch 21700, loss[loss=0.1711, simple_loss=0.251, pruned_loss=0.04556, over 21457.00 frames. ], tot_loss[loss=0.2025, simple_loss=0.2792, pruned_loss=0.06286, over 4249225.96 frames. ], batch size: 211, lr: 3.68e-03, grad_scale: 16.0 2023-06-25 20:18:35,483 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1410972.0, ans=0.015 2023-06-25 20:19:30,970 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.59 vs. limit=15.0 2023-06-25 20:19:33,117 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.665e+02 3.626e+02 5.313e+02 7.928e+02 1.804e+03, threshold=1.063e+03, percent-clipped=12.0 2023-06-25 20:20:12,869 INFO [train.py:996] (2/4) Epoch 8, batch 21750, loss[loss=0.2007, simple_loss=0.2711, pruned_loss=0.06514, over 21726.00 frames. ], tot_loss[loss=0.2015, simple_loss=0.2763, pruned_loss=0.0633, over 4247339.99 frames. ], batch size: 124, lr: 3.68e-03, grad_scale: 16.0 2023-06-25 20:20:52,874 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=1411332.0, ans=6.0 2023-06-25 20:21:20,067 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1411392.0, ans=0.0 2023-06-25 20:21:32,570 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1411452.0, ans=0.125 2023-06-25 20:21:32,593 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1411452.0, ans=0.1 2023-06-25 20:21:40,685 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1411452.0, ans=10.0 2023-06-25 20:21:51,878 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1411512.0, ans=0.125 2023-06-25 20:22:07,468 INFO [train.py:996] (2/4) Epoch 8, batch 21800, loss[loss=0.2025, simple_loss=0.2747, pruned_loss=0.06518, over 21566.00 frames. ], tot_loss[loss=0.2018, simple_loss=0.2751, pruned_loss=0.06426, over 4235488.30 frames. ], batch size: 391, lr: 3.68e-03, grad_scale: 16.0 2023-06-25 20:22:32,781 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1411632.0, ans=0.0 2023-06-25 20:23:00,330 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1411692.0, ans=0.0 2023-06-25 20:23:02,685 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=1411692.0, ans=6.0 2023-06-25 20:23:10,285 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.928e+02 3.851e+02 5.673e+02 8.450e+02 2.187e+03, threshold=1.135e+03, percent-clipped=14.0 2023-06-25 20:23:16,335 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1411752.0, ans=0.125 2023-06-25 20:23:36,900 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1411812.0, ans=0.5 2023-06-25 20:23:54,667 INFO [train.py:996] (2/4) Epoch 8, batch 21850, loss[loss=0.2024, simple_loss=0.2825, pruned_loss=0.06118, over 21273.00 frames. ], tot_loss[loss=0.205, simple_loss=0.2795, pruned_loss=0.06526, over 4244037.55 frames. ], batch size: 176, lr: 3.68e-03, grad_scale: 16.0 2023-06-25 20:24:32,195 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1411932.0, ans=0.025 2023-06-25 20:24:56,653 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.09 vs. limit=10.0 2023-06-25 20:25:44,261 INFO [train.py:996] (2/4) Epoch 8, batch 21900, loss[loss=0.2472, simple_loss=0.3111, pruned_loss=0.09162, over 21834.00 frames. ], tot_loss[loss=0.2075, simple_loss=0.283, pruned_loss=0.06598, over 4251862.10 frames. ], batch size: 441, lr: 3.68e-03, grad_scale: 16.0 2023-06-25 20:26:12,959 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.55 vs. limit=15.0 2023-06-25 20:26:36,775 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1412292.0, ans=0.2 2023-06-25 20:26:43,406 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1412292.0, ans=0.125 2023-06-25 20:26:46,377 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.095e+02 4.152e+02 5.797e+02 7.520e+02 1.468e+03, threshold=1.159e+03, percent-clipped=2.0 2023-06-25 20:26:48,980 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1412292.0, ans=0.2 2023-06-25 20:27:36,371 INFO [train.py:996] (2/4) Epoch 8, batch 21950, loss[loss=0.1305, simple_loss=0.2028, pruned_loss=0.02904, over 21203.00 frames. ], tot_loss[loss=0.2046, simple_loss=0.278, pruned_loss=0.06554, over 4251552.29 frames. ], batch size: 176, lr: 3.68e-03, grad_scale: 16.0 2023-06-25 20:27:40,823 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1412472.0, ans=0.2 2023-06-25 20:27:49,160 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1412472.0, ans=0.07 2023-06-25 20:27:51,509 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.13 vs. limit=22.5 2023-06-25 20:27:59,552 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1412532.0, ans=0.1 2023-06-25 20:29:25,487 INFO [train.py:996] (2/4) Epoch 8, batch 22000, loss[loss=0.2024, simple_loss=0.2693, pruned_loss=0.06779, over 21846.00 frames. ], tot_loss[loss=0.1996, simple_loss=0.2725, pruned_loss=0.06329, over 4259224.27 frames. ], batch size: 372, lr: 3.68e-03, grad_scale: 32.0 2023-06-25 20:29:33,211 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.15 vs. limit=10.0 2023-06-25 20:29:34,431 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1412772.0, ans=0.125 2023-06-25 20:30:20,199 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1412892.0, ans=0.125 2023-06-25 20:30:23,057 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.352e+02 3.905e+02 5.232e+02 7.810e+02 2.335e+03, threshold=1.046e+03, percent-clipped=14.0 2023-06-25 20:31:13,840 INFO [train.py:996] (2/4) Epoch 8, batch 22050, loss[loss=0.275, simple_loss=0.3513, pruned_loss=0.09936, over 21278.00 frames. ], tot_loss[loss=0.2019, simple_loss=0.2762, pruned_loss=0.06376, over 4257502.64 frames. ], batch size: 549, lr: 3.67e-03, grad_scale: 32.0 2023-06-25 20:31:16,275 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1413072.0, ans=0.0 2023-06-25 20:32:30,817 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.99 vs. limit=15.0 2023-06-25 20:32:38,108 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.78 vs. limit=15.0 2023-06-25 20:32:50,045 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.95 vs. limit=22.5 2023-06-25 20:32:54,609 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1413312.0, ans=0.0 2023-06-25 20:32:56,717 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1413312.0, ans=0.1 2023-06-25 20:33:02,606 INFO [train.py:996] (2/4) Epoch 8, batch 22100, loss[loss=0.2841, simple_loss=0.3495, pruned_loss=0.1094, over 21791.00 frames. ], tot_loss[loss=0.2108, simple_loss=0.2852, pruned_loss=0.06815, over 4255507.58 frames. ], batch size: 441, lr: 3.67e-03, grad_scale: 16.0 2023-06-25 20:33:33,972 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1413432.0, ans=0.125 2023-06-25 20:33:34,030 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1413432.0, ans=0.125 2023-06-25 20:33:43,853 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1413432.0, ans=0.95 2023-06-25 20:34:00,125 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.256e+02 4.552e+02 6.727e+02 1.040e+03 2.213e+03, threshold=1.345e+03, percent-clipped=23.0 2023-06-25 20:34:47,960 INFO [train.py:996] (2/4) Epoch 8, batch 22150, loss[loss=0.234, simple_loss=0.305, pruned_loss=0.08148, over 21734.00 frames. ], tot_loss[loss=0.2147, simple_loss=0.2895, pruned_loss=0.06993, over 4266200.35 frames. ], batch size: 389, lr: 3.67e-03, grad_scale: 16.0 2023-06-25 20:35:01,937 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1413672.0, ans=0.125 2023-06-25 20:35:41,977 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1413792.0, ans=0.0 2023-06-25 20:35:47,202 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=1413792.0, ans=0.05 2023-06-25 20:36:35,739 INFO [train.py:996] (2/4) Epoch 8, batch 22200, loss[loss=0.2588, simple_loss=0.347, pruned_loss=0.08526, over 21846.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.2931, pruned_loss=0.07125, over 4272763.01 frames. ], batch size: 371, lr: 3.67e-03, grad_scale: 16.0 2023-06-25 20:36:58,458 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.80 vs. limit=15.0 2023-06-25 20:37:04,536 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1414032.0, ans=0.07 2023-06-25 20:37:29,748 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.047e+02 4.294e+02 5.583e+02 8.306e+02 1.665e+03, threshold=1.117e+03, percent-clipped=3.0 2023-06-25 20:37:33,647 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1414152.0, ans=0.2 2023-06-25 20:38:23,335 INFO [train.py:996] (2/4) Epoch 8, batch 22250, loss[loss=0.2118, simple_loss=0.2774, pruned_loss=0.07311, over 20189.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.2967, pruned_loss=0.07205, over 4277779.66 frames. ], batch size: 702, lr: 3.67e-03, grad_scale: 16.0 2023-06-25 20:38:32,444 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1414272.0, ans=0.125 2023-06-25 20:38:36,420 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.08 vs. limit=12.0 2023-06-25 20:38:41,104 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1414272.0, ans=0.2 2023-06-25 20:39:42,905 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1414452.0, ans=0.0 2023-06-25 20:39:49,518 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1414512.0, ans=0.125 2023-06-25 20:40:04,026 INFO [train.py:996] (2/4) Epoch 8, batch 22300, loss[loss=0.2278, simple_loss=0.313, pruned_loss=0.07129, over 20798.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.2981, pruned_loss=0.07345, over 4277379.27 frames. ], batch size: 607, lr: 3.67e-03, grad_scale: 16.0 2023-06-25 20:40:57,266 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.146e+02 4.093e+02 5.360e+02 7.335e+02 1.399e+03, threshold=1.072e+03, percent-clipped=5.0 2023-06-25 20:41:42,641 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.82 vs. limit=15.0 2023-06-25 20:41:51,931 INFO [train.py:996] (2/4) Epoch 8, batch 22350, loss[loss=0.2183, simple_loss=0.2868, pruned_loss=0.07492, over 21329.00 frames. ], tot_loss[loss=0.2214, simple_loss=0.296, pruned_loss=0.07341, over 4285934.81 frames. ], batch size: 143, lr: 3.67e-03, grad_scale: 16.0 2023-06-25 20:42:02,359 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=1414872.0, ans=15.0 2023-06-25 20:42:03,090 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1414872.0, ans=0.125 2023-06-25 20:42:19,061 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1414932.0, ans=0.1 2023-06-25 20:42:19,144 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1414932.0, ans=0.2 2023-06-25 20:43:38,704 INFO [train.py:996] (2/4) Epoch 8, batch 22400, loss[loss=0.1857, simple_loss=0.2514, pruned_loss=0.05995, over 21552.00 frames. ], tot_loss[loss=0.2156, simple_loss=0.2922, pruned_loss=0.06949, over 4288338.69 frames. ], batch size: 230, lr: 3.67e-03, grad_scale: 32.0 2023-06-25 20:43:55,063 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1415232.0, ans=0.05 2023-06-25 20:43:55,066 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1415232.0, ans=0.125 2023-06-25 20:44:08,990 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1415232.0, ans=0.05 2023-06-25 20:44:34,054 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.012e+02 4.038e+02 6.138e+02 7.809e+02 1.292e+03, threshold=1.228e+03, percent-clipped=3.0 2023-06-25 20:45:25,836 INFO [train.py:996] (2/4) Epoch 8, batch 22450, loss[loss=0.2042, simple_loss=0.2726, pruned_loss=0.06792, over 21779.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.2874, pruned_loss=0.06936, over 4284730.03 frames. ], batch size: 317, lr: 3.67e-03, grad_scale: 16.0 2023-06-25 20:45:49,581 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.30 vs. limit=12.0 2023-06-25 20:45:51,042 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1415532.0, ans=0.125 2023-06-25 20:45:52,559 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1415532.0, ans=0.0 2023-06-25 20:46:33,528 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.56 vs. limit=6.0 2023-06-25 20:47:12,145 INFO [train.py:996] (2/4) Epoch 8, batch 22500, loss[loss=0.21, simple_loss=0.3045, pruned_loss=0.05775, over 21378.00 frames. ], tot_loss[loss=0.2103, simple_loss=0.2824, pruned_loss=0.06907, over 4286387.49 frames. ], batch size: 194, lr: 3.67e-03, grad_scale: 8.0 2023-06-25 20:47:28,413 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1415832.0, ans=0.125 2023-06-25 20:47:44,046 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1415832.0, ans=0.125 2023-06-25 20:48:14,384 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.749e+02 3.949e+02 4.919e+02 7.887e+02 2.030e+03, threshold=9.838e+02, percent-clipped=13.0 2023-06-25 20:48:17,500 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.73 vs. limit=15.0 2023-06-25 20:48:56,632 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1416012.0, ans=0.0 2023-06-25 20:49:01,265 INFO [train.py:996] (2/4) Epoch 8, batch 22550, loss[loss=0.2341, simple_loss=0.312, pruned_loss=0.07807, over 21890.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.2871, pruned_loss=0.0704, over 4291461.23 frames. ], batch size: 107, lr: 3.67e-03, grad_scale: 8.0 2023-06-25 20:49:48,953 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.99 vs. limit=5.0 2023-06-25 20:50:52,263 INFO [train.py:996] (2/4) Epoch 8, batch 22600, loss[loss=0.2296, simple_loss=0.3095, pruned_loss=0.07478, over 21674.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2906, pruned_loss=0.07019, over 4287848.28 frames. ], batch size: 389, lr: 3.67e-03, grad_scale: 8.0 2023-06-25 20:51:03,271 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1416372.0, ans=0.0 2023-06-25 20:51:45,265 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1416492.0, ans=0.2 2023-06-25 20:52:04,649 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.075e+02 4.518e+02 6.028e+02 9.364e+02 1.882e+03, threshold=1.206e+03, percent-clipped=21.0 2023-06-25 20:52:07,021 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1416552.0, ans=0.125 2023-06-25 20:52:37,760 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1416672.0, ans=0.025 2023-06-25 20:52:38,689 INFO [train.py:996] (2/4) Epoch 8, batch 22650, loss[loss=0.3111, simple_loss=0.3884, pruned_loss=0.1169, over 21417.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.2888, pruned_loss=0.07035, over 4282867.31 frames. ], batch size: 507, lr: 3.67e-03, grad_scale: 8.0 2023-06-25 20:52:57,208 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.45 vs. limit=15.0 2023-06-25 20:53:17,029 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1416732.0, ans=0.125 2023-06-25 20:53:31,551 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.10 vs. limit=15.0 2023-06-25 20:54:16,413 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1416912.0, ans=0.125 2023-06-25 20:54:16,428 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1416912.0, ans=0.125 2023-06-25 20:54:20,839 INFO [train.py:996] (2/4) Epoch 8, batch 22700, loss[loss=0.2006, simple_loss=0.268, pruned_loss=0.06665, over 20021.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.2841, pruned_loss=0.0703, over 4276944.14 frames. ], batch size: 703, lr: 3.67e-03, grad_scale: 8.0 2023-06-25 20:55:33,796 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.950e+02 3.999e+02 5.550e+02 8.694e+02 1.659e+03, threshold=1.110e+03, percent-clipped=6.0 2023-06-25 20:56:08,910 INFO [train.py:996] (2/4) Epoch 8, batch 22750, loss[loss=0.244, simple_loss=0.3181, pruned_loss=0.08498, over 21697.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2859, pruned_loss=0.07258, over 4279086.91 frames. ], batch size: 351, lr: 3.67e-03, grad_scale: 8.0 2023-06-25 20:56:21,870 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1417272.0, ans=0.125 2023-06-25 20:56:28,808 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1417272.0, ans=0.0 2023-06-25 20:56:30,607 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1417332.0, ans=0.125 2023-06-25 20:57:17,089 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1417392.0, ans=0.125 2023-06-25 20:57:27,111 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1417452.0, ans=0.2 2023-06-25 20:57:34,009 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1417452.0, ans=0.0 2023-06-25 20:57:55,435 INFO [train.py:996] (2/4) Epoch 8, batch 22800, loss[loss=0.1999, simple_loss=0.275, pruned_loss=0.06238, over 21865.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.2893, pruned_loss=0.07421, over 4275033.55 frames. ], batch size: 333, lr: 3.67e-03, grad_scale: 16.0 2023-06-25 20:58:04,406 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1417572.0, ans=0.0 2023-06-25 20:59:06,884 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.216e+02 4.609e+02 5.638e+02 8.633e+02 1.980e+03, threshold=1.128e+03, percent-clipped=10.0 2023-06-25 20:59:16,076 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1417752.0, ans=0.1 2023-06-25 20:59:27,146 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.48 vs. limit=15.0 2023-06-25 20:59:41,033 INFO [train.py:996] (2/4) Epoch 8, batch 22850, loss[loss=0.2111, simple_loss=0.2643, pruned_loss=0.07895, over 21485.00 frames. ], tot_loss[loss=0.2158, simple_loss=0.2853, pruned_loss=0.07319, over 4271012.09 frames. ], batch size: 441, lr: 3.67e-03, grad_scale: 16.0 2023-06-25 20:59:57,495 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.52 vs. limit=15.0 2023-06-25 21:00:44,835 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1417992.0, ans=0.0 2023-06-25 21:00:55,409 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1418052.0, ans=0.0 2023-06-25 21:00:57,180 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1418052.0, ans=0.125 2023-06-25 21:01:06,304 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1418052.0, ans=0.125 2023-06-25 21:01:15,146 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1418112.0, ans=0.025 2023-06-25 21:01:30,633 INFO [train.py:996] (2/4) Epoch 8, batch 22900, loss[loss=0.2482, simple_loss=0.359, pruned_loss=0.06873, over 21617.00 frames. ], tot_loss[loss=0.2158, simple_loss=0.2873, pruned_loss=0.0722, over 4280085.26 frames. ], batch size: 441, lr: 3.67e-03, grad_scale: 16.0 2023-06-25 21:02:28,981 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1418292.0, ans=0.2 2023-06-25 21:02:45,754 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.594e+02 4.620e+02 6.877e+02 1.071e+03 2.318e+03, threshold=1.375e+03, percent-clipped=23.0 2023-06-25 21:02:52,211 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.13 vs. limit=6.0 2023-06-25 21:02:53,339 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1418352.0, ans=0.125 2023-06-25 21:03:08,928 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1418412.0, ans=0.5 2023-06-25 21:03:25,354 INFO [train.py:996] (2/4) Epoch 8, batch 22950, loss[loss=0.2561, simple_loss=0.386, pruned_loss=0.06308, over 21617.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.3002, pruned_loss=0.07158, over 4274935.00 frames. ], batch size: 389, lr: 3.67e-03, grad_scale: 16.0 2023-06-25 21:04:12,080 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1418532.0, ans=0.125 2023-06-25 21:04:39,811 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1418652.0, ans=0.1 2023-06-25 21:05:10,018 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1418712.0, ans=0.125 2023-06-25 21:05:12,931 INFO [train.py:996] (2/4) Epoch 8, batch 23000, loss[loss=0.2235, simple_loss=0.2926, pruned_loss=0.07721, over 21916.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.299, pruned_loss=0.06944, over 4279337.22 frames. ], batch size: 333, lr: 3.67e-03, grad_scale: 16.0 2023-06-25 21:05:36,656 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1418832.0, ans=0.2 2023-06-25 21:05:46,902 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1418832.0, ans=0.1 2023-06-25 21:06:10,512 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.651e+02 4.060e+02 5.403e+02 8.584e+02 1.736e+03, threshold=1.081e+03, percent-clipped=10.0 2023-06-25 21:06:44,078 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1419012.0, ans=0.125 2023-06-25 21:06:55,858 INFO [train.py:996] (2/4) Epoch 8, batch 23050, loss[loss=0.2105, simple_loss=0.292, pruned_loss=0.06447, over 21693.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.2995, pruned_loss=0.07054, over 4281474.45 frames. ], batch size: 298, lr: 3.67e-03, grad_scale: 16.0 2023-06-25 21:07:13,793 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1419072.0, ans=0.125 2023-06-25 21:07:34,987 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1419132.0, ans=0.125 2023-06-25 21:07:41,381 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 21:07:50,072 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1419192.0, ans=0.1 2023-06-25 21:08:29,281 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1419312.0, ans=0.015 2023-06-25 21:08:42,762 INFO [train.py:996] (2/4) Epoch 8, batch 23100, loss[loss=0.181, simple_loss=0.2448, pruned_loss=0.05861, over 21524.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2941, pruned_loss=0.07054, over 4281623.81 frames. ], batch size: 212, lr: 3.67e-03, grad_scale: 16.0 2023-06-25 21:08:51,619 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1419372.0, ans=0.125 2023-06-25 21:09:44,363 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.818e+02 4.180e+02 5.701e+02 8.990e+02 1.720e+03, threshold=1.140e+03, percent-clipped=10.0 2023-06-25 21:10:16,886 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1419612.0, ans=0.125 2023-06-25 21:10:30,274 INFO [train.py:996] (2/4) Epoch 8, batch 23150, loss[loss=0.2171, simple_loss=0.2844, pruned_loss=0.07493, over 21928.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2883, pruned_loss=0.06984, over 4288845.74 frames. ], batch size: 316, lr: 3.67e-03, grad_scale: 16.0 2023-06-25 21:12:15,067 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1419912.0, ans=0.125 2023-06-25 21:12:17,933 INFO [train.py:996] (2/4) Epoch 8, batch 23200, loss[loss=0.2237, simple_loss=0.3009, pruned_loss=0.0732, over 21790.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.2875, pruned_loss=0.07085, over 4294303.15 frames. ], batch size: 112, lr: 3.67e-03, grad_scale: 32.0 2023-06-25 21:13:19,452 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.122e+02 4.151e+02 5.652e+02 8.200e+02 1.593e+03, threshold=1.130e+03, percent-clipped=6.0 2023-06-25 21:13:48,306 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1420212.0, ans=0.125 2023-06-25 21:13:59,484 INFO [train.py:996] (2/4) Epoch 8, batch 23250, loss[loss=0.2095, simple_loss=0.2831, pruned_loss=0.06796, over 21534.00 frames. ], tot_loss[loss=0.2164, simple_loss=0.2884, pruned_loss=0.0722, over 4296392.08 frames. ], batch size: 131, lr: 3.67e-03, grad_scale: 16.0 2023-06-25 21:15:34,238 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1420512.0, ans=0.2 2023-06-25 21:15:52,822 INFO [train.py:996] (2/4) Epoch 8, batch 23300, loss[loss=0.2303, simple_loss=0.3379, pruned_loss=0.06137, over 21786.00 frames. ], tot_loss[loss=0.222, simple_loss=0.2962, pruned_loss=0.07391, over 4299545.58 frames. ], batch size: 247, lr: 3.67e-03, grad_scale: 16.0 2023-06-25 21:16:12,639 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1420632.0, ans=0.0 2023-06-25 21:16:57,819 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.217e+02 4.429e+02 5.607e+02 7.442e+02 1.718e+03, threshold=1.121e+03, percent-clipped=5.0 2023-06-25 21:17:01,896 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1420752.0, ans=0.04949747468305833 2023-06-25 21:17:25,021 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1420812.0, ans=0.07 2023-06-25 21:17:41,370 INFO [train.py:996] (2/4) Epoch 8, batch 23350, loss[loss=0.1685, simple_loss=0.2531, pruned_loss=0.04193, over 21838.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.3014, pruned_loss=0.07303, over 4297407.46 frames. ], batch size: 317, lr: 3.66e-03, grad_scale: 16.0 2023-06-25 21:17:42,050 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1420872.0, ans=0.09899494936611666 2023-06-25 21:18:04,087 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.99 vs. limit=15.0 2023-06-25 21:18:17,269 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1420992.0, ans=0.1 2023-06-25 21:18:43,872 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1421052.0, ans=0.125 2023-06-25 21:18:54,328 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1421052.0, ans=0.0 2023-06-25 21:19:24,716 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1421112.0, ans=0.0 2023-06-25 21:19:29,508 INFO [train.py:996] (2/4) Epoch 8, batch 23400, loss[loss=0.1772, simple_loss=0.2666, pruned_loss=0.04393, over 20890.00 frames. ], tot_loss[loss=0.2172, simple_loss=0.2954, pruned_loss=0.06952, over 4289888.00 frames. ], batch size: 607, lr: 3.66e-03, grad_scale: 16.0 2023-06-25 21:20:10,200 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1421292.0, ans=0.125 2023-06-25 21:20:23,773 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.20 vs. limit=15.0 2023-06-25 21:20:34,180 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.708e+02 4.466e+02 6.262e+02 8.598e+02 1.529e+03, threshold=1.252e+03, percent-clipped=12.0 2023-06-25 21:20:57,300 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1421352.0, ans=0.0 2023-06-25 21:21:01,118 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=1421412.0, ans=10.0 2023-06-25 21:21:17,395 INFO [train.py:996] (2/4) Epoch 8, batch 23450, loss[loss=0.2623, simple_loss=0.3424, pruned_loss=0.09109, over 21516.00 frames. ], tot_loss[loss=0.219, simple_loss=0.295, pruned_loss=0.07151, over 4292220.77 frames. ], batch size: 131, lr: 3.66e-03, grad_scale: 16.0 2023-06-25 21:21:18,144 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1421472.0, ans=0.2 2023-06-25 21:21:26,398 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1421472.0, ans=0.125 2023-06-25 21:23:04,832 INFO [train.py:996] (2/4) Epoch 8, batch 23500, loss[loss=0.232, simple_loss=0.3578, pruned_loss=0.05305, over 19764.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.2962, pruned_loss=0.0735, over 4291374.86 frames. ], batch size: 702, lr: 3.66e-03, grad_scale: 16.0 2023-06-25 21:24:05,117 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1421952.0, ans=0.125 2023-06-25 21:24:07,692 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.437e+02 4.437e+02 5.920e+02 8.678e+02 1.556e+03, threshold=1.184e+03, percent-clipped=4.0 2023-06-25 21:24:20,255 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1421952.0, ans=0.125 2023-06-25 21:24:31,169 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1421952.0, ans=0.125 2023-06-25 21:24:50,830 INFO [train.py:996] (2/4) Epoch 8, batch 23550, loss[loss=0.2166, simple_loss=0.2626, pruned_loss=0.08528, over 21363.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.2914, pruned_loss=0.07277, over 4294212.00 frames. ], batch size: 508, lr: 3.66e-03, grad_scale: 16.0 2023-06-25 21:25:05,178 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1422072.0, ans=0.0 2023-06-25 21:25:38,545 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1422192.0, ans=0.1 2023-06-25 21:25:43,977 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1422192.0, ans=0.125 2023-06-25 21:26:34,220 INFO [train.py:996] (2/4) Epoch 8, batch 23600, loss[loss=0.2555, simple_loss=0.3265, pruned_loss=0.09222, over 21279.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.2911, pruned_loss=0.07312, over 4277630.26 frames. ], batch size: 159, lr: 3.66e-03, grad_scale: 32.0 2023-06-25 21:27:25,473 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.55 vs. limit=15.0 2023-06-25 21:27:45,447 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.060e+02 4.385e+02 5.770e+02 8.074e+02 1.431e+03, threshold=1.154e+03, percent-clipped=6.0 2023-06-25 21:28:00,337 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1422612.0, ans=0.125 2023-06-25 21:28:19,043 INFO [train.py:996] (2/4) Epoch 8, batch 23650, loss[loss=0.2645, simple_loss=0.3369, pruned_loss=0.09602, over 21718.00 frames. ], tot_loss[loss=0.2167, simple_loss=0.2909, pruned_loss=0.07121, over 4276856.77 frames. ], batch size: 441, lr: 3.66e-03, grad_scale: 32.0 2023-06-25 21:29:18,005 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1422792.0, ans=0.125 2023-06-25 21:30:15,693 INFO [train.py:996] (2/4) Epoch 8, batch 23700, loss[loss=0.2104, simple_loss=0.2962, pruned_loss=0.06229, over 21600.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2926, pruned_loss=0.07024, over 4277866.69 frames. ], batch size: 414, lr: 3.66e-03, grad_scale: 32.0 2023-06-25 21:30:21,679 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1422972.0, ans=0.125 2023-06-25 21:30:57,202 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1423032.0, ans=0.0 2023-06-25 21:31:21,478 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.07 vs. limit=10.0 2023-06-25 21:31:21,923 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.069e+02 4.706e+02 7.567e+02 1.059e+03 2.312e+03, threshold=1.513e+03, percent-clipped=21.0 2023-06-25 21:32:03,037 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1423212.0, ans=0.125 2023-06-25 21:32:05,912 INFO [train.py:996] (2/4) Epoch 8, batch 23750, loss[loss=0.2262, simple_loss=0.3034, pruned_loss=0.07448, over 21342.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.2956, pruned_loss=0.07059, over 4280793.57 frames. ], batch size: 176, lr: 3.66e-03, grad_scale: 32.0 2023-06-25 21:32:06,455 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1423272.0, ans=0.1 2023-06-25 21:32:55,718 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1423392.0, ans=0.125 2023-06-25 21:33:54,146 INFO [train.py:996] (2/4) Epoch 8, batch 23800, loss[loss=0.2068, simple_loss=0.2963, pruned_loss=0.05864, over 21613.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.2973, pruned_loss=0.07006, over 4273446.53 frames. ], batch size: 263, lr: 3.66e-03, grad_scale: 16.0 2023-06-25 21:34:48,806 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.74 vs. limit=15.0 2023-06-25 21:35:01,845 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.68 vs. limit=15.0 2023-06-25 21:35:03,396 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 21:35:08,035 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.992e+02 4.494e+02 6.635e+02 8.945e+02 1.790e+03, threshold=1.327e+03, percent-clipped=2.0 2023-06-25 21:35:12,256 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1423752.0, ans=0.2 2023-06-25 21:35:23,711 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.78 vs. limit=12.0 2023-06-25 21:35:28,309 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1423812.0, ans=0.1 2023-06-25 21:35:36,933 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1423812.0, ans=0.0 2023-06-25 21:35:44,160 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1423872.0, ans=0.0 2023-06-25 21:35:50,941 INFO [train.py:996] (2/4) Epoch 8, batch 23850, loss[loss=0.2362, simple_loss=0.3119, pruned_loss=0.0802, over 21569.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.3073, pruned_loss=0.07257, over 4275926.09 frames. ], batch size: 389, lr: 3.66e-03, grad_scale: 16.0 2023-06-25 21:36:18,927 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1423932.0, ans=0.1 2023-06-25 21:36:22,360 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1423932.0, ans=0.0 2023-06-25 21:36:22,473 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1423932.0, ans=0.2 2023-06-25 21:37:35,490 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=2.98 vs. limit=12.0 2023-06-25 21:37:39,761 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1424172.0, ans=0.125 2023-06-25 21:37:40,723 INFO [train.py:996] (2/4) Epoch 8, batch 23900, loss[loss=0.236, simple_loss=0.3189, pruned_loss=0.07657, over 20658.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.3137, pruned_loss=0.07462, over 4271474.49 frames. ], batch size: 607, lr: 3.66e-03, grad_scale: 16.0 2023-06-25 21:37:46,623 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1424172.0, ans=0.125 2023-06-25 21:38:21,971 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.87 vs. limit=15.0 2023-06-25 21:38:41,572 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.117e+02 4.954e+02 6.480e+02 8.834e+02 1.664e+03, threshold=1.296e+03, percent-clipped=3.0 2023-06-25 21:39:09,830 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1424412.0, ans=0.0 2023-06-25 21:39:13,818 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=1424412.0, ans=15.0 2023-06-25 21:39:22,024 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1424472.0, ans=0.025 2023-06-25 21:39:23,111 INFO [train.py:996] (2/4) Epoch 8, batch 23950, loss[loss=0.231, simple_loss=0.3027, pruned_loss=0.07965, over 21669.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.3079, pruned_loss=0.0744, over 4261344.96 frames. ], batch size: 332, lr: 3.66e-03, grad_scale: 16.0 2023-06-25 21:39:43,108 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1424532.0, ans=0.125 2023-06-25 21:39:51,607 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1424532.0, ans=0.125 2023-06-25 21:40:52,481 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1424712.0, ans=0.0 2023-06-25 21:41:11,183 INFO [train.py:996] (2/4) Epoch 8, batch 24000, loss[loss=0.2272, simple_loss=0.3235, pruned_loss=0.06541, over 19900.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.3089, pruned_loss=0.07722, over 4256060.43 frames. ], batch size: 703, lr: 3.66e-03, grad_scale: 32.0 2023-06-25 21:41:11,184 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-25 21:41:29,309 INFO [train.py:1028] (2/4) Epoch 8, validation: loss=0.2655, simple_loss=0.3581, pruned_loss=0.0864, over 1796401.00 frames. 2023-06-25 21:41:29,310 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 23731MB 2023-06-25 21:41:33,696 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1424772.0, ans=0.125 2023-06-25 21:42:15,127 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1424892.0, ans=0.125 2023-06-25 21:42:17,240 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.46 vs. limit=22.5 2023-06-25 21:42:49,030 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.318e+02 4.591e+02 6.093e+02 8.134e+02 1.870e+03, threshold=1.219e+03, percent-clipped=5.0 2023-06-25 21:43:18,417 INFO [train.py:996] (2/4) Epoch 8, batch 24050, loss[loss=0.2026, simple_loss=0.2933, pruned_loss=0.05592, over 21771.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3101, pruned_loss=0.07757, over 4264260.84 frames. ], batch size: 332, lr: 3.66e-03, grad_scale: 16.0 2023-06-25 21:43:36,719 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1425072.0, ans=0.2 2023-06-25 21:43:36,852 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1425072.0, ans=0.04949747468305833 2023-06-25 21:44:04,555 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1425192.0, ans=0.125 2023-06-25 21:45:14,049 INFO [train.py:996] (2/4) Epoch 8, batch 24100, loss[loss=0.2051, simple_loss=0.2968, pruned_loss=0.05673, over 20790.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.3096, pruned_loss=0.0761, over 4266305.73 frames. ], batch size: 607, lr: 3.66e-03, grad_scale: 16.0 2023-06-25 21:45:42,938 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1425432.0, ans=0.125 2023-06-25 21:46:27,132 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.209e+02 4.362e+02 5.817e+02 7.695e+02 1.790e+03, threshold=1.163e+03, percent-clipped=6.0 2023-06-25 21:46:27,806 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=1425552.0, ans=10.0 2023-06-25 21:46:44,058 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1425612.0, ans=0.125 2023-06-25 21:46:56,428 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1425612.0, ans=0.125 2023-06-25 21:47:02,457 INFO [train.py:996] (2/4) Epoch 8, batch 24150, loss[loss=0.2533, simple_loss=0.3188, pruned_loss=0.09387, over 21726.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.3088, pruned_loss=0.07746, over 4273239.83 frames. ], batch size: 389, lr: 3.66e-03, grad_scale: 16.0 2023-06-25 21:47:30,621 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1425732.0, ans=0.2 2023-06-25 21:47:35,859 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1425732.0, ans=0.0 2023-06-25 21:48:58,699 INFO [train.py:996] (2/4) Epoch 8, batch 24200, loss[loss=0.2127, simple_loss=0.2939, pruned_loss=0.06576, over 21682.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3101, pruned_loss=0.07821, over 4272794.08 frames. ], batch size: 247, lr: 3.66e-03, grad_scale: 16.0 2023-06-25 21:49:08,875 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.70 vs. limit=22.5 2023-06-25 21:49:18,029 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.39 vs. limit=12.0 2023-06-25 21:49:53,037 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1426092.0, ans=0.2 2023-06-25 21:49:53,137 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1426092.0, ans=0.125 2023-06-25 21:50:08,699 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1426152.0, ans=0.125 2023-06-25 21:50:13,416 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.946e+02 4.269e+02 5.400e+02 8.843e+02 1.561e+03, threshold=1.080e+03, percent-clipped=7.0 2023-06-25 21:50:17,613 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1426152.0, ans=0.5 2023-06-25 21:50:46,730 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1426212.0, ans=0.125 2023-06-25 21:50:49,374 INFO [train.py:996] (2/4) Epoch 8, batch 24250, loss[loss=0.1946, simple_loss=0.3174, pruned_loss=0.03592, over 20786.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.3088, pruned_loss=0.07382, over 4269948.60 frames. ], batch size: 607, lr: 3.66e-03, grad_scale: 16.0 2023-06-25 21:50:51,793 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1426272.0, ans=0.1 2023-06-25 21:51:52,542 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1426392.0, ans=0.1 2023-06-25 21:52:32,641 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.00 vs. limit=6.0 2023-06-25 21:52:36,523 INFO [train.py:996] (2/4) Epoch 8, batch 24300, loss[loss=0.2225, simple_loss=0.3322, pruned_loss=0.05634, over 20832.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.3037, pruned_loss=0.06886, over 4265410.63 frames. ], batch size: 607, lr: 3.66e-03, grad_scale: 16.0 2023-06-25 21:53:18,684 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.82 vs. limit=15.0 2023-06-25 21:53:44,494 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1426752.0, ans=0.125 2023-06-25 21:53:45,037 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.36 vs. limit=15.0 2023-06-25 21:53:48,811 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.598e+02 3.813e+02 5.438e+02 8.323e+02 1.746e+03, threshold=1.088e+03, percent-clipped=13.0 2023-06-25 21:54:09,236 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1426812.0, ans=0.0 2023-06-25 21:54:22,804 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff3.min_abs, batch_count=1426872.0, ans=0.2 2023-06-25 21:54:29,438 INFO [train.py:996] (2/4) Epoch 8, batch 24350, loss[loss=0.2362, simple_loss=0.3142, pruned_loss=0.07915, over 21828.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2979, pruned_loss=0.06733, over 4266406.61 frames. ], batch size: 298, lr: 3.66e-03, grad_scale: 16.0 2023-06-25 21:54:37,622 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.14 vs. limit=15.0 2023-06-25 21:54:40,360 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1426872.0, ans=0.125 2023-06-25 21:54:49,297 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1426932.0, ans=0.0 2023-06-25 21:55:14,176 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1426992.0, ans=0.125 2023-06-25 21:55:40,918 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1427052.0, ans=0.025 2023-06-25 21:55:41,002 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1427052.0, ans=0.125 2023-06-25 21:55:47,595 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1427052.0, ans=0.1 2023-06-25 21:56:18,889 INFO [train.py:996] (2/4) Epoch 8, batch 24400, loss[loss=0.2058, simple_loss=0.2878, pruned_loss=0.06189, over 21588.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.3015, pruned_loss=0.0702, over 4269798.87 frames. ], batch size: 263, lr: 3.66e-03, grad_scale: 16.0 2023-06-25 21:57:29,772 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.945e+02 4.612e+02 5.722e+02 8.222e+02 2.006e+03, threshold=1.144e+03, percent-clipped=13.0 2023-06-25 21:57:49,297 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 21:58:07,747 INFO [train.py:996] (2/4) Epoch 8, batch 24450, loss[loss=0.3152, simple_loss=0.3937, pruned_loss=0.1183, over 21461.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.3043, pruned_loss=0.07206, over 4273564.20 frames. ], batch size: 507, lr: 3.66e-03, grad_scale: 16.0 2023-06-25 21:58:36,707 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=1427532.0, ans=10.0 2023-06-25 21:59:06,132 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1427592.0, ans=0.0 2023-06-25 21:59:26,939 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1427712.0, ans=0.2 2023-06-25 21:59:47,371 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1427712.0, ans=0.0 2023-06-25 21:59:49,402 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1427712.0, ans=0.125 2023-06-25 21:59:54,321 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1427772.0, ans=0.2 2023-06-25 21:59:55,393 INFO [train.py:996] (2/4) Epoch 8, batch 24500, loss[loss=0.2031, simple_loss=0.2817, pruned_loss=0.06223, over 21850.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.3044, pruned_loss=0.07206, over 4276215.81 frames. ], batch size: 282, lr: 3.66e-03, grad_scale: 16.0 2023-06-25 22:00:22,529 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.19 vs. limit=22.5 2023-06-25 22:00:27,472 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.44 vs. limit=15.0 2023-06-25 22:00:45,790 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 22:01:04,621 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.043e+02 4.093e+02 5.380e+02 7.688e+02 2.312e+03, threshold=1.076e+03, percent-clipped=10.0 2023-06-25 22:01:47,727 INFO [train.py:996] (2/4) Epoch 8, batch 24550, loss[loss=0.2189, simple_loss=0.2949, pruned_loss=0.07147, over 21835.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.3056, pruned_loss=0.07411, over 4273053.37 frames. ], batch size: 247, lr: 3.66e-03, grad_scale: 16.0 2023-06-25 22:02:15,590 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1428132.0, ans=0.5 2023-06-25 22:03:33,561 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1428372.0, ans=0.07 2023-06-25 22:03:34,861 INFO [train.py:996] (2/4) Epoch 8, batch 24600, loss[loss=0.1853, simple_loss=0.248, pruned_loss=0.06129, over 21969.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.3016, pruned_loss=0.0747, over 4269274.53 frames. ], batch size: 103, lr: 3.66e-03, grad_scale: 16.0 2023-06-25 22:03:52,416 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1428432.0, ans=0.125 2023-06-25 22:04:14,986 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1428492.0, ans=0.2 2023-06-25 22:04:20,531 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.18 vs. limit=15.0 2023-06-25 22:04:35,332 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1428552.0, ans=0.125 2023-06-25 22:04:42,381 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1428552.0, ans=0.1 2023-06-25 22:04:43,329 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.133e+02 4.316e+02 5.425e+02 7.027e+02 1.651e+03, threshold=1.085e+03, percent-clipped=8.0 2023-06-25 22:05:21,821 INFO [train.py:996] (2/4) Epoch 8, batch 24650, loss[loss=0.1827, simple_loss=0.2537, pruned_loss=0.0558, over 21299.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.2935, pruned_loss=0.07311, over 4270903.15 frames. ], batch size: 131, lr: 3.65e-03, grad_scale: 16.0 2023-06-25 22:05:23,226 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.89 vs. limit=15.0 2023-06-25 22:05:43,102 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1428732.0, ans=0.0 2023-06-25 22:06:19,445 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1428792.0, ans=0.125 2023-06-25 22:06:38,334 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1428852.0, ans=0.125 2023-06-25 22:07:07,953 INFO [train.py:996] (2/4) Epoch 8, batch 24700, loss[loss=0.1947, simple_loss=0.2654, pruned_loss=0.06195, over 21551.00 frames. ], tot_loss[loss=0.2172, simple_loss=0.2913, pruned_loss=0.0715, over 4254036.70 frames. ], batch size: 414, lr: 3.65e-03, grad_scale: 16.0 2023-06-25 22:07:22,311 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1428972.0, ans=0.1 2023-06-25 22:07:34,710 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1429032.0, ans=0.0 2023-06-25 22:08:03,850 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1429092.0, ans=0.125 2023-06-25 22:08:05,638 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1429092.0, ans=0.125 2023-06-25 22:08:17,105 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.664e+02 4.405e+02 6.289e+02 8.929e+02 2.025e+03, threshold=1.258e+03, percent-clipped=12.0 2023-06-25 22:08:49,440 INFO [train.py:996] (2/4) Epoch 8, batch 24750, loss[loss=0.2255, simple_loss=0.3501, pruned_loss=0.05045, over 19777.00 frames. ], tot_loss[loss=0.2129, simple_loss=0.2866, pruned_loss=0.06957, over 4248386.19 frames. ], batch size: 702, lr: 3.65e-03, grad_scale: 16.0 2023-06-25 22:09:27,520 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1429332.0, ans=0.125 2023-06-25 22:09:37,564 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1429392.0, ans=0.0 2023-06-25 22:10:37,858 INFO [train.py:996] (2/4) Epoch 8, batch 24800, loss[loss=0.1612, simple_loss=0.2115, pruned_loss=0.0555, over 20829.00 frames. ], tot_loss[loss=0.2095, simple_loss=0.2811, pruned_loss=0.069, over 4255885.51 frames. ], batch size: 609, lr: 3.65e-03, grad_scale: 32.0 2023-06-25 22:10:55,750 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1429632.0, ans=0.125 2023-06-25 22:11:26,544 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1429692.0, ans=0.125 2023-06-25 22:11:49,239 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.829e+02 4.217e+02 5.954e+02 8.314e+02 1.595e+03, threshold=1.191e+03, percent-clipped=9.0 2023-06-25 22:12:11,164 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.58 vs. limit=22.5 2023-06-25 22:12:20,377 INFO [train.py:996] (2/4) Epoch 8, batch 24850, loss[loss=0.1844, simple_loss=0.247, pruned_loss=0.06091, over 21290.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.2818, pruned_loss=0.07018, over 4265299.81 frames. ], batch size: 159, lr: 3.65e-03, grad_scale: 16.0 2023-06-25 22:12:45,861 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff3.min_abs, batch_count=1429932.0, ans=0.2 2023-06-25 22:12:49,196 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1429932.0, ans=0.0 2023-06-25 22:14:09,939 INFO [train.py:996] (2/4) Epoch 8, batch 24900, loss[loss=0.258, simple_loss=0.3284, pruned_loss=0.09382, over 21594.00 frames. ], tot_loss[loss=0.2125, simple_loss=0.2841, pruned_loss=0.0704, over 4272058.20 frames. ], batch size: 389, lr: 3.65e-03, grad_scale: 16.0 2023-06-25 22:14:19,556 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1430172.0, ans=0.1 2023-06-25 22:14:21,621 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1430172.0, ans=0.125 2023-06-25 22:14:34,674 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 22:14:38,241 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1430232.0, ans=0.0 2023-06-25 22:15:31,591 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.069e+02 4.057e+02 5.546e+02 7.694e+02 2.051e+03, threshold=1.109e+03, percent-clipped=6.0 2023-06-25 22:15:43,376 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1430412.0, ans=0.125 2023-06-25 22:15:58,286 INFO [train.py:996] (2/4) Epoch 8, batch 24950, loss[loss=0.2293, simple_loss=0.3041, pruned_loss=0.07731, over 20632.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.2918, pruned_loss=0.07432, over 4265799.93 frames. ], batch size: 607, lr: 3.65e-03, grad_scale: 16.0 2023-06-25 22:16:59,446 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1430592.0, ans=0.0 2023-06-25 22:17:13,925 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.21 vs. limit=12.0 2023-06-25 22:17:40,454 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1430712.0, ans=0.0 2023-06-25 22:17:46,860 INFO [train.py:996] (2/4) Epoch 8, batch 25000, loss[loss=0.2458, simple_loss=0.2937, pruned_loss=0.0989, over 21352.00 frames. ], tot_loss[loss=0.226, simple_loss=0.2988, pruned_loss=0.07666, over 4270962.84 frames. ], batch size: 507, lr: 3.65e-03, grad_scale: 16.0 2023-06-25 22:17:47,587 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1430772.0, ans=0.125 2023-06-25 22:18:00,325 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1430772.0, ans=0.025 2023-06-25 22:18:17,661 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.57 vs. limit=22.5 2023-06-25 22:18:54,489 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1430892.0, ans=0.125 2023-06-25 22:19:07,214 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.95 vs. limit=12.0 2023-06-25 22:19:07,612 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.298e+02 4.363e+02 6.743e+02 9.687e+02 1.962e+03, threshold=1.349e+03, percent-clipped=15.0 2023-06-25 22:19:18,251 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1431012.0, ans=0.125 2023-06-25 22:19:32,783 INFO [train.py:996] (2/4) Epoch 8, batch 25050, loss[loss=0.1895, simple_loss=0.2572, pruned_loss=0.06092, over 21780.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.2928, pruned_loss=0.07521, over 4276281.76 frames. ], batch size: 317, lr: 3.65e-03, grad_scale: 16.0 2023-06-25 22:19:59,658 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.98 vs. limit=22.5 2023-06-25 22:21:08,565 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1431312.0, ans=0.125 2023-06-25 22:21:19,836 INFO [train.py:996] (2/4) Epoch 8, batch 25100, loss[loss=0.1942, simple_loss=0.2846, pruned_loss=0.05187, over 21544.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2867, pruned_loss=0.07381, over 4271039.05 frames. ], batch size: 230, lr: 3.65e-03, grad_scale: 16.0 2023-06-25 22:21:20,423 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1431372.0, ans=0.125 2023-06-25 22:21:45,083 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1431432.0, ans=0.015 2023-06-25 22:21:57,270 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 22:22:41,591 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.039e+02 4.362e+02 5.445e+02 8.840e+02 1.769e+03, threshold=1.089e+03, percent-clipped=5.0 2023-06-25 22:22:44,346 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1431552.0, ans=0.125 2023-06-25 22:22:51,135 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1431612.0, ans=0.0 2023-06-25 22:23:07,187 INFO [train.py:996] (2/4) Epoch 8, batch 25150, loss[loss=0.1992, simple_loss=0.2879, pruned_loss=0.05528, over 21762.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2908, pruned_loss=0.07162, over 4260984.67 frames. ], batch size: 247, lr: 3.65e-03, grad_scale: 16.0 2023-06-25 22:23:32,432 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1431732.0, ans=0.0 2023-06-25 22:24:14,632 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1431792.0, ans=0.2 2023-06-25 22:24:46,218 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.56 vs. limit=15.0 2023-06-25 22:24:46,862 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1431912.0, ans=0.0 2023-06-25 22:24:55,098 INFO [train.py:996] (2/4) Epoch 8, batch 25200, loss[loss=0.2094, simple_loss=0.3078, pruned_loss=0.05552, over 21608.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2911, pruned_loss=0.06995, over 4264319.86 frames. ], batch size: 263, lr: 3.65e-03, grad_scale: 32.0 2023-06-25 22:25:02,468 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1431972.0, ans=0.0 2023-06-25 22:25:04,030 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1431972.0, ans=0.0 2023-06-25 22:25:46,064 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1432092.0, ans=0.2 2023-06-25 22:26:00,624 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1432092.0, ans=0.07 2023-06-25 22:26:18,384 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.791e+02 3.750e+02 5.347e+02 7.396e+02 1.859e+03, threshold=1.069e+03, percent-clipped=8.0 2023-06-25 22:26:18,848 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1432152.0, ans=0.035 2023-06-25 22:26:41,764 INFO [train.py:996] (2/4) Epoch 8, batch 25250, loss[loss=0.2222, simple_loss=0.2814, pruned_loss=0.08151, over 21504.00 frames. ], tot_loss[loss=0.2123, simple_loss=0.2883, pruned_loss=0.06815, over 4258271.73 frames. ], batch size: 414, lr: 3.65e-03, grad_scale: 16.0 2023-06-25 22:27:01,247 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.01 vs. limit=12.0 2023-06-25 22:27:10,618 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1432332.0, ans=0.0 2023-06-25 22:27:51,778 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 22:28:15,738 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1432512.0, ans=0.0 2023-06-25 22:28:29,124 INFO [train.py:996] (2/4) Epoch 8, batch 25300, loss[loss=0.2444, simple_loss=0.3209, pruned_loss=0.08393, over 21575.00 frames. ], tot_loss[loss=0.2101, simple_loss=0.286, pruned_loss=0.06713, over 4242677.65 frames. ], batch size: 414, lr: 3.65e-03, grad_scale: 16.0 2023-06-25 22:28:35,117 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1432572.0, ans=0.0 2023-06-25 22:29:53,832 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.880e+02 4.048e+02 5.397e+02 7.800e+02 1.560e+03, threshold=1.079e+03, percent-clipped=8.0 2023-06-25 22:30:17,503 INFO [train.py:996] (2/4) Epoch 8, batch 25350, loss[loss=0.184, simple_loss=0.2669, pruned_loss=0.05055, over 21716.00 frames. ], tot_loss[loss=0.2108, simple_loss=0.2875, pruned_loss=0.06702, over 4253943.83 frames. ], batch size: 298, lr: 3.65e-03, grad_scale: 16.0 2023-06-25 22:30:27,299 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1432872.0, ans=0.0 2023-06-25 22:30:28,832 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1432872.0, ans=0.1 2023-06-25 22:30:42,420 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1432932.0, ans=0.0 2023-06-25 22:30:55,108 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1432932.0, ans=0.1 2023-06-25 22:31:12,482 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1432992.0, ans=0.1 2023-06-25 22:31:38,069 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1433052.0, ans=0.07 2023-06-25 22:31:55,500 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.80 vs. limit=10.0 2023-06-25 22:31:59,556 INFO [train.py:996] (2/4) Epoch 8, batch 25400, loss[loss=0.2205, simple_loss=0.2851, pruned_loss=0.07789, over 21759.00 frames. ], tot_loss[loss=0.2076, simple_loss=0.2831, pruned_loss=0.06609, over 4256328.22 frames. ], batch size: 316, lr: 3.65e-03, grad_scale: 16.0 2023-06-25 22:32:31,923 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=1433232.0, ans=15.0 2023-06-25 22:33:13,868 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1433352.0, ans=0.0 2023-06-25 22:33:21,728 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.049e+02 4.073e+02 6.227e+02 9.020e+02 1.627e+03, threshold=1.245e+03, percent-clipped=13.0 2023-06-25 22:33:32,642 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1433412.0, ans=0.1 2023-06-25 22:33:37,901 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1433412.0, ans=0.0 2023-06-25 22:33:45,993 INFO [train.py:996] (2/4) Epoch 8, batch 25450, loss[loss=0.1886, simple_loss=0.287, pruned_loss=0.04507, over 21801.00 frames. ], tot_loss[loss=0.208, simple_loss=0.283, pruned_loss=0.06649, over 4257647.21 frames. ], batch size: 282, lr: 3.65e-03, grad_scale: 16.0 2023-06-25 22:34:07,054 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1433472.0, ans=0.0 2023-06-25 22:34:17,683 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1433532.0, ans=0.125 2023-06-25 22:34:47,802 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1433592.0, ans=0.125 2023-06-25 22:35:31,625 INFO [train.py:996] (2/4) Epoch 8, batch 25500, loss[loss=0.2476, simple_loss=0.3317, pruned_loss=0.08177, over 21473.00 frames. ], tot_loss[loss=0.2061, simple_loss=0.2838, pruned_loss=0.06413, over 4260434.69 frames. ], batch size: 471, lr: 3.65e-03, grad_scale: 16.0 2023-06-25 22:35:34,214 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1433772.0, ans=0.0 2023-06-25 22:35:37,878 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1433772.0, ans=0.2 2023-06-25 22:36:24,993 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 22:36:56,877 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.763e+02 3.870e+02 4.829e+02 7.230e+02 1.638e+03, threshold=9.659e+02, percent-clipped=1.0 2023-06-25 22:37:02,983 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1434012.0, ans=0.0 2023-06-25 22:37:18,801 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1434012.0, ans=0.0 2023-06-25 22:37:21,636 INFO [train.py:996] (2/4) Epoch 8, batch 25550, loss[loss=0.1947, simple_loss=0.275, pruned_loss=0.05723, over 21462.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.292, pruned_loss=0.06511, over 4256734.54 frames. ], batch size: 131, lr: 3.65e-03, grad_scale: 16.0 2023-06-25 22:38:28,759 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1434192.0, ans=0.125 2023-06-25 22:38:32,708 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.26 vs. limit=12.0 2023-06-25 22:38:47,564 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1434252.0, ans=0.0 2023-06-25 22:38:47,617 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1434252.0, ans=0.125 2023-06-25 22:38:47,665 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1434252.0, ans=0.09899494936611666 2023-06-25 22:39:13,982 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1434372.0, ans=0.125 2023-06-25 22:39:20,118 INFO [train.py:996] (2/4) Epoch 8, batch 25600, loss[loss=0.2344, simple_loss=0.3104, pruned_loss=0.07918, over 21885.00 frames. ], tot_loss[loss=0.2142, simple_loss=0.2958, pruned_loss=0.06635, over 4253165.91 frames. ], batch size: 371, lr: 3.65e-03, grad_scale: 32.0 2023-06-25 22:40:22,390 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1434552.0, ans=0.09899494936611666 2023-06-25 22:40:27,601 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1434552.0, ans=0.2 2023-06-25 22:40:27,614 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1434552.0, ans=0.125 2023-06-25 22:40:31,979 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.098e+02 4.217e+02 6.682e+02 9.360e+02 1.950e+03, threshold=1.336e+03, percent-clipped=22.0 2023-06-25 22:41:11,570 INFO [train.py:996] (2/4) Epoch 8, batch 25650, loss[loss=0.1889, simple_loss=0.2537, pruned_loss=0.06206, over 21653.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.2963, pruned_loss=0.06866, over 4260964.31 frames. ], batch size: 282, lr: 3.65e-03, grad_scale: 32.0 2023-06-25 22:41:27,736 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1434672.0, ans=0.07 2023-06-25 22:42:17,007 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1434852.0, ans=0.0 2023-06-25 22:42:26,038 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1434912.0, ans=0.125 2023-06-25 22:42:48,766 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 22:42:57,504 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1434972.0, ans=0.0 2023-06-25 22:42:58,589 INFO [train.py:996] (2/4) Epoch 8, batch 25700, loss[loss=0.2342, simple_loss=0.2961, pruned_loss=0.0862, over 21712.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2927, pruned_loss=0.06937, over 4266001.51 frames. ], batch size: 441, lr: 3.65e-03, grad_scale: 32.0 2023-06-25 22:44:06,642 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.951e+02 3.970e+02 5.194e+02 7.142e+02 1.504e+03, threshold=1.039e+03, percent-clipped=1.0 2023-06-25 22:44:11,290 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 22:44:48,504 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1435212.0, ans=0.07 2023-06-25 22:44:52,957 INFO [train.py:996] (2/4) Epoch 8, batch 25750, loss[loss=0.2548, simple_loss=0.323, pruned_loss=0.09324, over 21810.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.2995, pruned_loss=0.07251, over 4271470.02 frames. ], batch size: 118, lr: 3.65e-03, grad_scale: 32.0 2023-06-25 22:45:27,368 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1435332.0, ans=0.1 2023-06-25 22:45:34,580 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1435392.0, ans=0.125 2023-06-25 22:45:36,422 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1435392.0, ans=0.0 2023-06-25 22:45:52,156 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1435452.0, ans=0.07 2023-06-25 22:46:40,526 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1435512.0, ans=0.0 2023-06-25 22:46:45,429 INFO [train.py:996] (2/4) Epoch 8, batch 25800, loss[loss=0.2965, simple_loss=0.3682, pruned_loss=0.1124, over 21769.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3112, pruned_loss=0.07621, over 4269951.50 frames. ], batch size: 441, lr: 3.65e-03, grad_scale: 16.0 2023-06-25 22:46:46,345 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1435572.0, ans=0.0 2023-06-25 22:46:49,915 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1435572.0, ans=0.2 2023-06-25 22:46:50,007 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1435572.0, ans=0.125 2023-06-25 22:48:11,408 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.235e+02 4.952e+02 6.520e+02 9.122e+02 2.118e+03, threshold=1.304e+03, percent-clipped=17.0 2023-06-25 22:48:17,284 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1435812.0, ans=0.0 2023-06-25 22:48:17,306 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1435812.0, ans=0.0 2023-06-25 22:48:33,935 INFO [train.py:996] (2/4) Epoch 8, batch 25850, loss[loss=0.2241, simple_loss=0.2828, pruned_loss=0.08269, over 20291.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.312, pruned_loss=0.07535, over 4270459.38 frames. ], batch size: 707, lr: 3.65e-03, grad_scale: 16.0 2023-06-25 22:48:37,269 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=1435872.0, ans=15.0 2023-06-25 22:49:27,926 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.21 vs. limit=15.0 2023-06-25 22:49:31,208 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1435992.0, ans=0.0 2023-06-25 22:49:39,958 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1435992.0, ans=0.125 2023-06-25 22:50:01,695 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.18 vs. limit=10.0 2023-06-25 22:50:23,327 INFO [train.py:996] (2/4) Epoch 8, batch 25900, loss[loss=0.2639, simple_loss=0.3543, pruned_loss=0.08674, over 21804.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3134, pruned_loss=0.07668, over 4275013.86 frames. ], batch size: 282, lr: 3.65e-03, grad_scale: 16.0 2023-06-25 22:50:26,373 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.76 vs. limit=15.0 2023-06-25 22:50:31,118 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1436172.0, ans=0.0 2023-06-25 22:50:33,064 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1436172.0, ans=0.125 2023-06-25 22:51:05,041 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.51 vs. limit=6.0 2023-06-25 22:51:43,656 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.598e+02 5.216e+02 8.298e+02 1.003e+03 1.891e+03, threshold=1.660e+03, percent-clipped=7.0 2023-06-25 22:52:06,601 INFO [train.py:996] (2/4) Epoch 8, batch 25950, loss[loss=0.2103, simple_loss=0.298, pruned_loss=0.06134, over 21678.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.3182, pruned_loss=0.07851, over 4277461.76 frames. ], batch size: 231, lr: 3.64e-03, grad_scale: 16.0 2023-06-25 22:52:07,180 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1436472.0, ans=0.0 2023-06-25 22:52:58,809 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1436532.0, ans=0.125 2023-06-25 22:53:53,842 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1436712.0, ans=0.0 2023-06-25 22:53:58,754 INFO [train.py:996] (2/4) Epoch 8, batch 26000, loss[loss=0.2362, simple_loss=0.3173, pruned_loss=0.0775, over 21375.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3172, pruned_loss=0.0768, over 4275931.11 frames. ], batch size: 549, lr: 3.64e-03, grad_scale: 32.0 2023-06-25 22:54:28,112 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1436832.0, ans=0.2 2023-06-25 22:54:48,150 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1436832.0, ans=0.0 2023-06-25 22:55:20,389 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.951e+02 4.119e+02 5.246e+02 6.904e+02 1.299e+03, threshold=1.049e+03, percent-clipped=0.0 2023-06-25 22:55:32,846 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1437012.0, ans=0.125 2023-06-25 22:55:47,878 INFO [train.py:996] (2/4) Epoch 8, batch 26050, loss[loss=0.2292, simple_loss=0.2994, pruned_loss=0.07944, over 21865.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3154, pruned_loss=0.07811, over 4279758.83 frames. ], batch size: 124, lr: 3.64e-03, grad_scale: 32.0 2023-06-25 22:56:00,804 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1437072.0, ans=0.1 2023-06-25 22:56:24,783 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1437132.0, ans=0.125 2023-06-25 22:57:09,270 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1437312.0, ans=0.0 2023-06-25 22:57:12,584 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1437312.0, ans=0.125 2023-06-25 22:57:28,611 INFO [train.py:996] (2/4) Epoch 8, batch 26100, loss[loss=0.2087, simple_loss=0.2769, pruned_loss=0.07024, over 21589.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.3107, pruned_loss=0.07715, over 4278399.11 frames. ], batch size: 212, lr: 3.64e-03, grad_scale: 32.0 2023-06-25 22:57:43,938 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1437372.0, ans=0.125 2023-06-25 22:58:03,882 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.28 vs. limit=15.0 2023-06-25 22:58:44,092 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.195e+02 4.438e+02 5.651e+02 7.112e+02 1.480e+03, threshold=1.130e+03, percent-clipped=4.0 2023-06-25 22:58:58,955 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.71 vs. limit=10.0 2023-06-25 22:59:00,136 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1437612.0, ans=0.0 2023-06-25 22:59:18,473 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.63 vs. limit=15.0 2023-06-25 22:59:22,536 INFO [train.py:996] (2/4) Epoch 8, batch 26150, loss[loss=0.2281, simple_loss=0.3164, pruned_loss=0.0699, over 21437.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.3078, pruned_loss=0.07733, over 4286671.62 frames. ], batch size: 131, lr: 3.64e-03, grad_scale: 16.0 2023-06-25 23:00:07,635 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1437792.0, ans=0.0 2023-06-25 23:00:32,853 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.92 vs. limit=22.5 2023-06-25 23:01:09,413 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1437912.0, ans=0.125 2023-06-25 23:01:09,459 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1437912.0, ans=0.125 2023-06-25 23:01:11,043 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1437972.0, ans=0.125 2023-06-25 23:01:12,288 INFO [train.py:996] (2/4) Epoch 8, batch 26200, loss[loss=0.2394, simple_loss=0.3414, pruned_loss=0.06871, over 21760.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.3095, pruned_loss=0.07615, over 4281360.71 frames. ], batch size: 298, lr: 3.64e-03, grad_scale: 16.0 2023-06-25 23:02:08,864 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 23:02:15,767 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1438152.0, ans=0.0 2023-06-25 23:02:23,569 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.227e+02 4.470e+02 5.888e+02 8.750e+02 1.495e+03, threshold=1.178e+03, percent-clipped=8.0 2023-06-25 23:02:55,432 INFO [train.py:996] (2/4) Epoch 8, batch 26250, loss[loss=0.2182, simple_loss=0.2937, pruned_loss=0.07138, over 21246.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3113, pruned_loss=0.075, over 4275068.23 frames. ], batch size: 143, lr: 3.64e-03, grad_scale: 16.0 2023-06-25 23:02:56,238 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1438272.0, ans=0.125 2023-06-25 23:03:13,916 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.36 vs. limit=10.0 2023-06-25 23:03:51,235 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.03 vs. limit=10.0 2023-06-25 23:04:03,123 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.69 vs. limit=15.0 2023-06-25 23:04:36,275 INFO [train.py:996] (2/4) Epoch 8, batch 26300, loss[loss=0.1946, simple_loss=0.2558, pruned_loss=0.06668, over 21240.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.309, pruned_loss=0.07584, over 4284974.99 frames. ], batch size: 608, lr: 3.64e-03, grad_scale: 16.0 2023-06-25 23:04:56,162 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1438632.0, ans=0.125 2023-06-25 23:05:33,636 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.21 vs. limit=15.0 2023-06-25 23:06:03,913 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.391e+02 4.218e+02 5.396e+02 7.440e+02 1.508e+03, threshold=1.079e+03, percent-clipped=2.0 2023-06-25 23:06:24,590 INFO [train.py:996] (2/4) Epoch 8, batch 26350, loss[loss=0.2491, simple_loss=0.3252, pruned_loss=0.08656, over 21274.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.307, pruned_loss=0.07603, over 4284501.81 frames. ], batch size: 143, lr: 3.64e-03, grad_scale: 16.0 2023-06-25 23:08:09,004 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.38 vs. limit=12.0 2023-06-25 23:08:11,333 INFO [train.py:996] (2/4) Epoch 8, batch 26400, loss[loss=0.2228, simple_loss=0.2724, pruned_loss=0.08659, over 21518.00 frames. ], tot_loss[loss=0.227, simple_loss=0.3018, pruned_loss=0.07607, over 4279557.52 frames. ], batch size: 441, lr: 3.64e-03, grad_scale: 32.0 2023-06-25 23:08:48,997 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1439232.0, ans=0.1 2023-06-25 23:08:58,651 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.12 vs. limit=15.0 2023-06-25 23:09:11,145 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1439292.0, ans=0.125 2023-06-25 23:09:11,178 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1439292.0, ans=0.125 2023-06-25 23:09:36,049 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.120e+02 4.025e+02 5.044e+02 7.451e+02 1.741e+03, threshold=1.009e+03, percent-clipped=9.0 2023-06-25 23:09:57,665 INFO [train.py:996] (2/4) Epoch 8, batch 26450, loss[loss=0.2122, simple_loss=0.2848, pruned_loss=0.06978, over 21220.00 frames. ], tot_loss[loss=0.228, simple_loss=0.304, pruned_loss=0.07605, over 4278478.06 frames. ], batch size: 159, lr: 3.64e-03, grad_scale: 32.0 2023-06-25 23:10:42,316 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.40 vs. limit=15.0 2023-06-25 23:11:21,248 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1439652.0, ans=0.125 2023-06-25 23:11:48,720 INFO [train.py:996] (2/4) Epoch 8, batch 26500, loss[loss=0.1787, simple_loss=0.236, pruned_loss=0.06075, over 21751.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.305, pruned_loss=0.07481, over 4270481.72 frames. ], batch size: 124, lr: 3.64e-03, grad_scale: 16.0 2023-06-25 23:12:26,304 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1439832.0, ans=0.2 2023-06-25 23:13:02,384 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1439952.0, ans=0.0 2023-06-25 23:13:22,174 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1439952.0, ans=0.1 2023-06-25 23:13:23,165 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.879e+02 4.514e+02 6.896e+02 1.400e+03 2.768e+03, threshold=1.379e+03, percent-clipped=34.0 2023-06-25 23:13:53,895 INFO [train.py:996] (2/4) Epoch 8, batch 26550, loss[loss=0.2023, simple_loss=0.2949, pruned_loss=0.05485, over 21685.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.3043, pruned_loss=0.07306, over 4269896.80 frames. ], batch size: 332, lr: 3.64e-03, grad_scale: 16.0 2023-06-25 23:14:54,358 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1440192.0, ans=0.125 2023-06-25 23:15:01,088 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1440252.0, ans=0.125 2023-06-25 23:15:32,116 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1440312.0, ans=0.2 2023-06-25 23:15:47,313 INFO [train.py:996] (2/4) Epoch 8, batch 26600, loss[loss=0.2451, simple_loss=0.3204, pruned_loss=0.08491, over 21556.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.3029, pruned_loss=0.06991, over 4272510.27 frames. ], batch size: 441, lr: 3.64e-03, grad_scale: 16.0 2023-06-25 23:16:05,411 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1440432.0, ans=0.0 2023-06-25 23:16:53,318 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.38 vs. limit=15.0 2023-06-25 23:16:54,880 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.42 vs. limit=15.0 2023-06-25 23:17:00,247 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.847e+02 4.407e+02 5.733e+02 8.512e+02 1.391e+03, threshold=1.147e+03, percent-clipped=1.0 2023-06-25 23:17:15,468 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1440612.0, ans=0.125 2023-06-25 23:17:31,095 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1440612.0, ans=0.125 2023-06-25 23:17:35,792 INFO [train.py:996] (2/4) Epoch 8, batch 26650, loss[loss=0.1633, simple_loss=0.253, pruned_loss=0.03687, over 21667.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.2963, pruned_loss=0.06873, over 4255021.66 frames. ], batch size: 391, lr: 3.64e-03, grad_scale: 16.0 2023-06-25 23:18:03,573 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1440732.0, ans=0.0 2023-06-25 23:18:43,561 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1440852.0, ans=0.2 2023-06-25 23:19:18,194 INFO [train.py:996] (2/4) Epoch 8, batch 26700, loss[loss=0.2124, simple_loss=0.2874, pruned_loss=0.06869, over 21931.00 frames. ], tot_loss[loss=0.2107, simple_loss=0.2889, pruned_loss=0.06622, over 4254261.94 frames. ], batch size: 333, lr: 3.64e-03, grad_scale: 16.0 2023-06-25 23:19:45,758 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.01 vs. limit=22.5 2023-06-25 23:19:46,951 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1441032.0, ans=0.125 2023-06-25 23:20:34,688 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1441152.0, ans=0.5 2023-06-25 23:20:37,590 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.642e+02 3.824e+02 5.567e+02 8.569e+02 1.745e+03, threshold=1.113e+03, percent-clipped=13.0 2023-06-25 23:20:47,765 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.54 vs. limit=10.0 2023-06-25 23:21:01,558 INFO [train.py:996] (2/4) Epoch 8, batch 26750, loss[loss=0.2293, simple_loss=0.318, pruned_loss=0.07025, over 21575.00 frames. ], tot_loss[loss=0.2098, simple_loss=0.2893, pruned_loss=0.06514, over 4260188.97 frames. ], batch size: 389, lr: 3.64e-03, grad_scale: 16.0 2023-06-25 23:21:10,470 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.03 vs. limit=15.0 2023-06-25 23:21:16,614 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1441272.0, ans=0.125 2023-06-25 23:21:47,750 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 23:22:29,558 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.11 vs. limit=10.0 2023-06-25 23:22:46,503 INFO [train.py:996] (2/4) Epoch 8, batch 26800, loss[loss=0.2545, simple_loss=0.3242, pruned_loss=0.09247, over 21611.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2943, pruned_loss=0.06737, over 4264643.30 frames. ], batch size: 389, lr: 3.64e-03, grad_scale: 32.0 2023-06-25 23:24:14,276 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.253e+02 4.422e+02 6.215e+02 9.798e+02 1.990e+03, threshold=1.243e+03, percent-clipped=9.0 2023-06-25 23:24:21,004 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.32 vs. limit=15.0 2023-06-25 23:24:26,009 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.31 vs. limit=12.0 2023-06-25 23:24:26,227 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.68 vs. limit=15.0 2023-06-25 23:24:38,147 INFO [train.py:996] (2/4) Epoch 8, batch 26850, loss[loss=0.1853, simple_loss=0.2555, pruned_loss=0.05755, over 15516.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.2959, pruned_loss=0.06991, over 4261318.44 frames. ], batch size: 60, lr: 3.64e-03, grad_scale: 32.0 2023-06-25 23:24:40,538 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1441872.0, ans=0.025 2023-06-25 23:26:00,686 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1442112.0, ans=0.1 2023-06-25 23:26:14,122 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1442112.0, ans=0.125 2023-06-25 23:26:17,485 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1442112.0, ans=0.1 2023-06-25 23:26:18,910 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1442172.0, ans=0.125 2023-06-25 23:26:20,006 INFO [train.py:996] (2/4) Epoch 8, batch 26900, loss[loss=0.1886, simple_loss=0.2493, pruned_loss=0.06392, over 21167.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.2873, pruned_loss=0.06905, over 4264581.28 frames. ], batch size: 143, lr: 3.64e-03, grad_scale: 16.0 2023-06-25 23:26:20,606 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1442172.0, ans=0.0 2023-06-25 23:26:41,843 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.30 vs. limit=15.0 2023-06-25 23:27:19,791 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1442292.0, ans=0.0 2023-06-25 23:27:26,676 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1442352.0, ans=0.0 2023-06-25 23:27:40,240 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.068e+02 3.925e+02 6.896e+02 1.001e+03 2.184e+03, threshold=1.379e+03, percent-clipped=14.0 2023-06-25 23:27:58,331 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 23:27:59,725 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1442412.0, ans=0.125 2023-06-25 23:28:02,263 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.76 vs. limit=15.0 2023-06-25 23:28:02,733 INFO [train.py:996] (2/4) Epoch 8, batch 26950, loss[loss=0.2328, simple_loss=0.3241, pruned_loss=0.07075, over 21713.00 frames. ], tot_loss[loss=0.2123, simple_loss=0.2861, pruned_loss=0.06925, over 4260958.30 frames. ], batch size: 298, lr: 3.64e-03, grad_scale: 16.0 2023-06-25 23:28:25,803 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1442532.0, ans=10.0 2023-06-25 23:28:32,560 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1442532.0, ans=0.0 2023-06-25 23:28:36,146 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1442592.0, ans=0.0 2023-06-25 23:29:17,530 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1442652.0, ans=0.0 2023-06-25 23:29:21,029 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1442652.0, ans=0.125 2023-06-25 23:29:52,144 INFO [train.py:996] (2/4) Epoch 8, batch 27000, loss[loss=0.2039, simple_loss=0.3002, pruned_loss=0.05383, over 21593.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.2874, pruned_loss=0.0679, over 4267660.03 frames. ], batch size: 389, lr: 3.64e-03, grad_scale: 16.0 2023-06-25 23:29:52,144 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-25 23:30:09,505 INFO [zipformer.py:1728] (2/4) name=encoder.encoders.1.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([4.4915, 4.5771, 2.1798, 4.0588], device='cuda:2') 2023-06-25 23:30:10,472 INFO [train.py:1028] (2/4) Epoch 8, validation: loss=0.2506, simple_loss=0.341, pruned_loss=0.08006, over 1796401.00 frames. 2023-06-25 23:30:10,473 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 23731MB 2023-06-25 23:30:50,102 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1442832.0, ans=0.125 2023-06-25 23:30:51,601 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1442892.0, ans=0.1 2023-06-25 23:31:05,262 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1442892.0, ans=0.2 2023-06-25 23:31:10,292 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1442952.0, ans=0.125 2023-06-25 23:31:32,484 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.635e+02 4.043e+02 5.265e+02 7.888e+02 2.132e+03, threshold=1.053e+03, percent-clipped=7.0 2023-06-25 23:31:43,124 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1443012.0, ans=0.2 2023-06-25 23:31:49,559 INFO [train.py:996] (2/4) Epoch 8, batch 27050, loss[loss=0.2057, simple_loss=0.295, pruned_loss=0.05818, over 21798.00 frames. ], tot_loss[loss=0.2098, simple_loss=0.2894, pruned_loss=0.06512, over 4267206.46 frames. ], batch size: 247, lr: 3.64e-03, grad_scale: 16.0 2023-06-25 23:32:15,889 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1443132.0, ans=0.2 2023-06-25 23:32:17,371 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1443132.0, ans=0.125 2023-06-25 23:32:35,181 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1443132.0, ans=0.0 2023-06-25 23:33:18,677 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.14 vs. limit=6.0 2023-06-25 23:33:34,087 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 23:33:38,799 INFO [train.py:996] (2/4) Epoch 8, batch 27100, loss[loss=0.2169, simple_loss=0.3159, pruned_loss=0.05893, over 21728.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.2917, pruned_loss=0.0666, over 4272537.80 frames. ], batch size: 389, lr: 3.64e-03, grad_scale: 16.0 2023-06-25 23:34:13,989 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1443432.0, ans=0.05 2023-06-25 23:35:11,013 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.112e+02 4.566e+02 6.448e+02 9.782e+02 2.509e+03, threshold=1.290e+03, percent-clipped=22.0 2023-06-25 23:35:33,787 INFO [train.py:996] (2/4) Epoch 8, batch 27150, loss[loss=0.2606, simple_loss=0.354, pruned_loss=0.08364, over 21716.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.3025, pruned_loss=0.07003, over 4279755.47 frames. ], batch size: 298, lr: 3.64e-03, grad_scale: 16.0 2023-06-25 23:35:50,985 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.20 vs. limit=6.0 2023-06-25 23:36:09,682 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1443732.0, ans=0.0 2023-06-25 23:36:10,187 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.80 vs. limit=22.5 2023-06-25 23:36:46,216 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1443852.0, ans=0.0 2023-06-25 23:37:28,331 INFO [train.py:996] (2/4) Epoch 8, batch 27200, loss[loss=0.2376, simple_loss=0.3186, pruned_loss=0.07828, over 21481.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.3082, pruned_loss=0.07229, over 4272452.68 frames. ], batch size: 211, lr: 3.64e-03, grad_scale: 32.0 2023-06-25 23:37:48,833 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.78 vs. limit=22.5 2023-06-25 23:38:34,099 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1444152.0, ans=0.0 2023-06-25 23:39:01,345 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.411e+02 4.854e+02 6.757e+02 9.648e+02 1.735e+03, threshold=1.351e+03, percent-clipped=9.0 2023-06-25 23:39:10,778 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1444212.0, ans=0.0 2023-06-25 23:39:18,850 INFO [train.py:996] (2/4) Epoch 8, batch 27250, loss[loss=0.2617, simple_loss=0.3511, pruned_loss=0.0861, over 20821.00 frames. ], tot_loss[loss=0.233, simple_loss=0.3123, pruned_loss=0.07685, over 4272125.93 frames. ], batch size: 608, lr: 3.64e-03, grad_scale: 16.0 2023-06-25 23:39:23,484 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1444272.0, ans=0.125 2023-06-25 23:39:42,553 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1444332.0, ans=0.0 2023-06-25 23:40:08,913 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.18 vs. limit=15.0 2023-06-25 23:40:33,053 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1444452.0, ans=0.2 2023-06-25 23:40:49,209 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1444452.0, ans=0.0 2023-06-25 23:41:14,467 INFO [train.py:996] (2/4) Epoch 8, batch 27300, loss[loss=0.2376, simple_loss=0.3241, pruned_loss=0.07555, over 21255.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3155, pruned_loss=0.07814, over 4278342.79 frames. ], batch size: 159, lr: 3.63e-03, grad_scale: 16.0 2023-06-25 23:42:26,318 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1444752.0, ans=0.0 2023-06-25 23:42:43,064 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.204e+02 4.436e+02 5.757e+02 8.260e+02 1.524e+03, threshold=1.151e+03, percent-clipped=4.0 2023-06-25 23:42:57,146 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1444812.0, ans=0.125 2023-06-25 23:42:58,635 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1444812.0, ans=0.125 2023-06-25 23:43:03,219 INFO [train.py:996] (2/4) Epoch 8, batch 27350, loss[loss=0.2379, simple_loss=0.3178, pruned_loss=0.079, over 21646.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3187, pruned_loss=0.07916, over 4280578.20 frames. ], batch size: 112, lr: 3.63e-03, grad_scale: 16.0 2023-06-25 23:43:15,900 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1444872.0, ans=0.125 2023-06-25 23:43:20,996 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1444932.0, ans=0.125 2023-06-25 23:44:06,507 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1444992.0, ans=0.125 2023-06-25 23:44:41,317 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.84 vs. limit=15.0 2023-06-25 23:44:45,623 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1445112.0, ans=0.0 2023-06-25 23:44:50,146 INFO [train.py:996] (2/4) Epoch 8, batch 27400, loss[loss=0.2108, simple_loss=0.2735, pruned_loss=0.07402, over 21628.00 frames. ], tot_loss[loss=0.2352, simple_loss=0.3136, pruned_loss=0.07835, over 4287124.97 frames. ], batch size: 414, lr: 3.63e-03, grad_scale: 16.0 2023-06-25 23:45:15,255 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1445232.0, ans=0.0 2023-06-25 23:45:15,731 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.68 vs. limit=15.0 2023-06-25 23:46:05,173 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1445352.0, ans=0.125 2023-06-25 23:46:08,427 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1445352.0, ans=0.125 2023-06-25 23:46:14,720 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.145e+02 3.925e+02 4.930e+02 6.414e+02 1.207e+03, threshold=9.861e+02, percent-clipped=2.0 2023-06-25 23:46:33,513 INFO [train.py:996] (2/4) Epoch 8, batch 27450, loss[loss=0.2236, simple_loss=0.3178, pruned_loss=0.06473, over 21226.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.3071, pruned_loss=0.07656, over 4280130.66 frames. ], batch size: 548, lr: 3.63e-03, grad_scale: 8.0 2023-06-25 23:46:35,918 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1445472.0, ans=0.04949747468305833 2023-06-25 23:47:58,765 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1445712.0, ans=0.125 2023-06-25 23:48:04,228 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1445712.0, ans=0.125 2023-06-25 23:48:18,951 INFO [train.py:996] (2/4) Epoch 8, batch 27500, loss[loss=0.2437, simple_loss=0.3187, pruned_loss=0.08437, over 21445.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.3057, pruned_loss=0.07672, over 4285403.98 frames. ], batch size: 131, lr: 3.63e-03, grad_scale: 8.0 2023-06-25 23:48:21,420 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1445772.0, ans=0.2 2023-06-25 23:49:05,381 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1445892.0, ans=0.1 2023-06-25 23:49:42,671 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.981e+02 3.778e+02 4.835e+02 6.283e+02 1.305e+03, threshold=9.670e+02, percent-clipped=1.0 2023-06-25 23:50:00,292 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1446072.0, ans=0.125 2023-06-25 23:50:01,310 INFO [train.py:996] (2/4) Epoch 8, batch 27550, loss[loss=0.1934, simple_loss=0.2679, pruned_loss=0.05938, over 21495.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.2996, pruned_loss=0.07298, over 4291712.71 frames. ], batch size: 389, lr: 3.63e-03, grad_scale: 8.0 2023-06-25 23:50:07,365 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1446072.0, ans=0.2 2023-06-25 23:51:13,770 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1446252.0, ans=0.125 2023-06-25 23:51:40,260 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.66 vs. limit=15.0 2023-06-25 23:51:49,472 INFO [train.py:996] (2/4) Epoch 8, batch 27600, loss[loss=0.1913, simple_loss=0.2622, pruned_loss=0.06019, over 21758.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.295, pruned_loss=0.07236, over 4274743.37 frames. ], batch size: 300, lr: 3.63e-03, grad_scale: 16.0 2023-06-25 23:53:16,416 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.997e+02 3.759e+02 4.592e+02 6.391e+02 1.970e+03, threshold=9.184e+02, percent-clipped=8.0 2023-06-25 23:53:28,233 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1446612.0, ans=0.0 2023-06-25 23:53:34,788 INFO [train.py:996] (2/4) Epoch 8, batch 27650, loss[loss=0.2089, simple_loss=0.2779, pruned_loss=0.06995, over 21217.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.2887, pruned_loss=0.07155, over 4274561.93 frames. ], batch size: 143, lr: 3.63e-03, grad_scale: 16.0 2023-06-25 23:54:15,545 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1446732.0, ans=0.1 2023-06-25 23:55:09,653 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1446912.0, ans=0.1 2023-06-25 23:55:22,825 INFO [train.py:996] (2/4) Epoch 8, batch 27700, loss[loss=0.2359, simple_loss=0.3256, pruned_loss=0.07312, over 21696.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.2892, pruned_loss=0.0698, over 4265496.59 frames. ], batch size: 298, lr: 3.63e-03, grad_scale: 16.0 2023-06-25 23:56:21,015 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1447092.0, ans=0.125 2023-06-25 23:56:45,209 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1447152.0, ans=0.125 2023-06-25 23:56:45,303 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1447152.0, ans=0.0 2023-06-25 23:56:56,236 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.091e+02 3.950e+02 5.187e+02 7.067e+02 1.545e+03, threshold=1.037e+03, percent-clipped=11.0 2023-06-25 23:57:05,383 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1447212.0, ans=0.125 2023-06-25 23:57:06,898 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1447212.0, ans=0.0 2023-06-25 23:57:09,971 INFO [train.py:996] (2/4) Epoch 8, batch 27750, loss[loss=0.1931, simple_loss=0.284, pruned_loss=0.05112, over 21851.00 frames. ], tot_loss[loss=0.2156, simple_loss=0.2927, pruned_loss=0.06923, over 4270020.84 frames. ], batch size: 316, lr: 3.63e-03, grad_scale: 16.0 2023-06-25 23:57:23,014 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.86 vs. limit=15.0 2023-06-25 23:57:47,770 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1447332.0, ans=0.0 2023-06-25 23:58:54,843 INFO [train.py:996] (2/4) Epoch 8, batch 27800, loss[loss=0.1994, simple_loss=0.2715, pruned_loss=0.06366, over 21859.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.2907, pruned_loss=0.06925, over 4281637.56 frames. ], batch size: 282, lr: 3.63e-03, grad_scale: 16.0 2023-06-25 23:59:09,436 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.02 vs. limit=10.0 2023-06-25 23:59:09,700 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.12 vs. limit=6.0 2023-06-25 23:59:35,501 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.84 vs. limit=10.0 2023-06-25 23:59:43,781 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1447692.0, ans=0.0 2023-06-26 00:00:01,204 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1447752.0, ans=0.0 2023-06-26 00:00:23,976 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.743e+02 4.274e+02 5.854e+02 7.453e+02 1.495e+03, threshold=1.171e+03, percent-clipped=16.0 2023-06-26 00:00:42,957 INFO [train.py:996] (2/4) Epoch 8, batch 27850, loss[loss=0.2228, simple_loss=0.2976, pruned_loss=0.07403, over 21730.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2906, pruned_loss=0.07083, over 4289619.98 frames. ], batch size: 389, lr: 3.63e-03, grad_scale: 16.0 2023-06-26 00:01:27,446 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1447932.0, ans=0.07 2023-06-26 00:02:11,198 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 00:02:23,912 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1448112.0, ans=0.025 2023-06-26 00:02:39,189 INFO [train.py:996] (2/4) Epoch 8, batch 27900, loss[loss=0.2707, simple_loss=0.3798, pruned_loss=0.08082, over 21194.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.3001, pruned_loss=0.07177, over 4289995.02 frames. ], batch size: 548, lr: 3.63e-03, grad_scale: 16.0 2023-06-26 00:02:39,915 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1448172.0, ans=0.1 2023-06-26 00:03:01,655 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1448232.0, ans=0.0 2023-06-26 00:03:01,716 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1448232.0, ans=0.125 2023-06-26 00:03:18,031 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1448232.0, ans=0.0 2023-06-26 00:03:35,431 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1448292.0, ans=0.125 2023-06-26 00:04:15,993 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.747e+02 3.981e+02 4.843e+02 6.105e+02 1.501e+03, threshold=9.685e+02, percent-clipped=1.0 2023-06-26 00:04:35,196 INFO [train.py:996] (2/4) Epoch 8, batch 27950, loss[loss=0.2557, simple_loss=0.3425, pruned_loss=0.08439, over 21460.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.2992, pruned_loss=0.06825, over 4291173.53 frames. ], batch size: 471, lr: 3.63e-03, grad_scale: 16.0 2023-06-26 00:04:39,613 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1448472.0, ans=0.125 2023-06-26 00:04:52,174 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.86 vs. limit=15.0 2023-06-26 00:06:00,778 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.35 vs. limit=15.0 2023-06-26 00:06:22,310 INFO [train.py:996] (2/4) Epoch 8, batch 28000, loss[loss=0.2141, simple_loss=0.2864, pruned_loss=0.07084, over 21552.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.2966, pruned_loss=0.06633, over 4284854.38 frames. ], batch size: 131, lr: 3.63e-03, grad_scale: 32.0 2023-06-26 00:06:38,410 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1448832.0, ans=0.125 2023-06-26 00:07:13,220 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1448892.0, ans=0.0 2023-06-26 00:07:23,838 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1448952.0, ans=0.0 2023-06-26 00:07:58,624 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.068e+02 4.485e+02 6.487e+02 9.458e+02 1.758e+03, threshold=1.297e+03, percent-clipped=21.0 2023-06-26 00:08:02,739 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1449012.0, ans=0.1 2023-06-26 00:08:10,963 INFO [train.py:996] (2/4) Epoch 8, batch 28050, loss[loss=0.1926, simple_loss=0.2705, pruned_loss=0.05735, over 21777.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.2937, pruned_loss=0.06708, over 4289324.09 frames. ], batch size: 298, lr: 3.63e-03, grad_scale: 16.0 2023-06-26 00:08:56,376 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 00:09:08,504 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=1449252.0, ans=0.95 2023-06-26 00:09:47,160 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.93 vs. limit=15.0 2023-06-26 00:09:57,867 INFO [train.py:996] (2/4) Epoch 8, batch 28100, loss[loss=0.1901, simple_loss=0.2535, pruned_loss=0.06338, over 21285.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.2936, pruned_loss=0.06732, over 4281786.56 frames. ], batch size: 159, lr: 3.63e-03, grad_scale: 16.0 2023-06-26 00:10:05,630 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1449372.0, ans=0.125 2023-06-26 00:10:27,832 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1449432.0, ans=0.05 2023-06-26 00:10:29,415 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1449432.0, ans=0.0 2023-06-26 00:10:35,796 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1449492.0, ans=0.125 2023-06-26 00:11:27,654 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.094e+02 4.530e+02 6.783e+02 9.812e+02 2.062e+03, threshold=1.357e+03, percent-clipped=16.0 2023-06-26 00:11:40,020 INFO [train.py:996] (2/4) Epoch 8, batch 28150, loss[loss=0.1825, simple_loss=0.2492, pruned_loss=0.05794, over 21550.00 frames. ], tot_loss[loss=0.2119, simple_loss=0.2883, pruned_loss=0.06774, over 4275945.48 frames. ], batch size: 263, lr: 3.63e-03, grad_scale: 16.0 2023-06-26 00:11:44,187 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1449672.0, ans=0.0 2023-06-26 00:11:44,243 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1449672.0, ans=0.0 2023-06-26 00:12:07,149 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.25 vs. limit=15.0 2023-06-26 00:12:52,136 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1449852.0, ans=0.0 2023-06-26 00:13:01,063 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1449852.0, ans=0.2 2023-06-26 00:13:17,039 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1449912.0, ans=0.0 2023-06-26 00:13:18,485 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1449912.0, ans=0.035 2023-06-26 00:13:26,692 INFO [train.py:996] (2/4) Epoch 8, batch 28200, loss[loss=0.2063, simple_loss=0.2688, pruned_loss=0.07189, over 21383.00 frames. ], tot_loss[loss=0.2115, simple_loss=0.2855, pruned_loss=0.0688, over 4270447.93 frames. ], batch size: 211, lr: 3.63e-03, grad_scale: 16.0 2023-06-26 00:13:44,918 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1449972.0, ans=0.05 2023-06-26 00:13:45,039 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1449972.0, ans=0.125 2023-06-26 00:14:05,885 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1450092.0, ans=10.0 2023-06-26 00:14:21,436 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1450092.0, ans=0.0 2023-06-26 00:14:36,715 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=1450092.0, ans=0.5 2023-06-26 00:14:36,769 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1450092.0, ans=0.125 2023-06-26 00:15:02,518 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.442e+02 4.547e+02 5.710e+02 8.432e+02 1.923e+03, threshold=1.142e+03, percent-clipped=7.0 2023-06-26 00:15:03,336 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1450212.0, ans=0.1 2023-06-26 00:15:14,910 INFO [train.py:996] (2/4) Epoch 8, batch 28250, loss[loss=0.1996, simple_loss=0.2742, pruned_loss=0.06253, over 21893.00 frames. ], tot_loss[loss=0.2167, simple_loss=0.2897, pruned_loss=0.07179, over 4273264.37 frames. ], batch size: 317, lr: 3.63e-03, grad_scale: 16.0 2023-06-26 00:16:06,711 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1450392.0, ans=0.0 2023-06-26 00:16:27,584 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1450452.0, ans=0.2 2023-06-26 00:16:51,403 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.81 vs. limit=22.5 2023-06-26 00:17:04,118 INFO [train.py:996] (2/4) Epoch 8, batch 28300, loss[loss=0.1788, simple_loss=0.2839, pruned_loss=0.03685, over 21215.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2878, pruned_loss=0.06966, over 4266712.96 frames. ], batch size: 548, lr: 3.63e-03, grad_scale: 16.0 2023-06-26 00:17:17,026 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1450572.0, ans=0.125 2023-06-26 00:18:07,231 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1450692.0, ans=0.125 2023-06-26 00:18:35,339 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.53 vs. limit=22.5 2023-06-26 00:18:39,094 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.826e+02 4.235e+02 6.929e+02 1.082e+03 2.013e+03, threshold=1.386e+03, percent-clipped=23.0 2023-06-26 00:18:56,490 INFO [train.py:996] (2/4) Epoch 8, batch 28350, loss[loss=0.1865, simple_loss=0.2532, pruned_loss=0.05988, over 21281.00 frames. ], tot_loss[loss=0.2061, simple_loss=0.2837, pruned_loss=0.06426, over 4264213.00 frames. ], batch size: 159, lr: 3.63e-03, grad_scale: 16.0 2023-06-26 00:19:48,968 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1450992.0, ans=0.0 2023-06-26 00:19:57,517 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1450992.0, ans=0.125 2023-06-26 00:20:30,733 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1451112.0, ans=0.2 2023-06-26 00:20:43,841 INFO [train.py:996] (2/4) Epoch 8, batch 28400, loss[loss=0.2117, simple_loss=0.2812, pruned_loss=0.07109, over 21450.00 frames. ], tot_loss[loss=0.2047, simple_loss=0.2813, pruned_loss=0.06406, over 4264222.13 frames. ], batch size: 211, lr: 3.63e-03, grad_scale: 32.0 2023-06-26 00:21:04,035 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1451172.0, ans=0.0 2023-06-26 00:21:30,333 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 00:21:32,136 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1451232.0, ans=0.0 2023-06-26 00:21:54,662 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.57 vs. limit=15.0 2023-06-26 00:21:59,431 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1451352.0, ans=0.2 2023-06-26 00:22:20,865 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.356e+02 4.435e+02 6.673e+02 8.870e+02 1.776e+03, threshold=1.335e+03, percent-clipped=3.0 2023-06-26 00:22:25,196 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1451412.0, ans=0.2 2023-06-26 00:22:31,339 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=10.67 vs. limit=15.0 2023-06-26 00:22:31,523 INFO [train.py:996] (2/4) Epoch 8, batch 28450, loss[loss=0.2207, simple_loss=0.2898, pruned_loss=0.07577, over 20675.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.2862, pruned_loss=0.06802, over 4270831.97 frames. ], batch size: 607, lr: 3.63e-03, grad_scale: 16.0 2023-06-26 00:22:41,343 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.12 vs. limit=10.0 2023-06-26 00:23:24,826 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1451592.0, ans=0.125 2023-06-26 00:24:30,340 INFO [train.py:996] (2/4) Epoch 8, batch 28500, loss[loss=0.2321, simple_loss=0.3218, pruned_loss=0.0712, over 21811.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2883, pruned_loss=0.07035, over 4277179.17 frames. ], batch size: 118, lr: 3.63e-03, grad_scale: 16.0 2023-06-26 00:24:50,957 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.08 vs. limit=10.0 2023-06-26 00:25:15,422 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1451892.0, ans=0.125 2023-06-26 00:25:17,208 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1451892.0, ans=0.125 2023-06-26 00:25:24,024 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1451892.0, ans=0.1 2023-06-26 00:26:09,286 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.427e+02 4.818e+02 6.676e+02 8.470e+02 2.134e+03, threshold=1.335e+03, percent-clipped=3.0 2023-06-26 00:26:19,591 INFO [train.py:996] (2/4) Epoch 8, batch 28550, loss[loss=0.3383, simple_loss=0.4169, pruned_loss=0.1298, over 21452.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.2965, pruned_loss=0.07346, over 4279673.19 frames. ], batch size: 507, lr: 3.63e-03, grad_scale: 16.0 2023-06-26 00:26:40,095 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1452072.0, ans=0.0 2023-06-26 00:26:45,614 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 00:27:06,608 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1452192.0, ans=0.125 2023-06-26 00:27:10,050 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1452192.0, ans=0.125 2023-06-26 00:28:15,670 INFO [train.py:996] (2/4) Epoch 8, batch 28600, loss[loss=0.2345, simple_loss=0.3089, pruned_loss=0.08008, over 20717.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.3039, pruned_loss=0.07597, over 4273979.57 frames. ], batch size: 607, lr: 3.62e-03, grad_scale: 16.0 2023-06-26 00:28:33,656 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1452432.0, ans=0.0 2023-06-26 00:28:35,431 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1452432.0, ans=0.125 2023-06-26 00:28:39,377 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.67 vs. limit=12.0 2023-06-26 00:28:42,891 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.84 vs. limit=22.5 2023-06-26 00:28:57,565 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1452492.0, ans=0.125 2023-06-26 00:29:31,229 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1452552.0, ans=0.125 2023-06-26 00:29:53,323 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.158e+02 4.451e+02 5.957e+02 7.529e+02 1.462e+03, threshold=1.191e+03, percent-clipped=3.0 2023-06-26 00:30:03,859 INFO [train.py:996] (2/4) Epoch 8, batch 28650, loss[loss=0.2065, simple_loss=0.2736, pruned_loss=0.06974, over 21673.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.2975, pruned_loss=0.07484, over 4274083.99 frames. ], batch size: 333, lr: 3.62e-03, grad_scale: 16.0 2023-06-26 00:30:20,817 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.84 vs. limit=12.0 2023-06-26 00:30:22,375 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.53 vs. limit=22.5 2023-06-26 00:31:47,839 INFO [train.py:996] (2/4) Epoch 8, batch 28700, loss[loss=0.2392, simple_loss=0.3138, pruned_loss=0.08231, over 21901.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.2949, pruned_loss=0.07476, over 4278134.44 frames. ], batch size: 372, lr: 3.62e-03, grad_scale: 16.0 2023-06-26 00:31:50,292 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1452972.0, ans=0.125 2023-06-26 00:32:05,658 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1453032.0, ans=0.0 2023-06-26 00:32:14,973 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.37 vs. limit=6.0 2023-06-26 00:32:26,087 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1453092.0, ans=0.125 2023-06-26 00:33:19,799 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.264e+02 4.611e+02 5.755e+02 7.778e+02 1.501e+03, threshold=1.151e+03, percent-clipped=4.0 2023-06-26 00:33:30,641 INFO [train.py:996] (2/4) Epoch 8, batch 28750, loss[loss=0.2423, simple_loss=0.3195, pruned_loss=0.08259, over 21759.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.2954, pruned_loss=0.07517, over 4278711.37 frames. ], batch size: 112, lr: 3.62e-03, grad_scale: 16.0 2023-06-26 00:34:46,886 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1453452.0, ans=0.1 2023-06-26 00:34:56,015 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1453452.0, ans=0.0 2023-06-26 00:34:56,031 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1453452.0, ans=0.1 2023-06-26 00:35:11,541 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.48 vs. limit=15.0 2023-06-26 00:35:18,781 INFO [train.py:996] (2/4) Epoch 8, batch 28800, loss[loss=0.2311, simple_loss=0.31, pruned_loss=0.07613, over 20703.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.299, pruned_loss=0.07516, over 4279506.45 frames. ], batch size: 608, lr: 3.62e-03, grad_scale: 32.0 2023-06-26 00:36:14,313 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1453692.0, ans=0.0 2023-06-26 00:36:14,451 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1453692.0, ans=0.125 2023-06-26 00:36:55,665 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.079e+02 4.504e+02 5.803e+02 7.798e+02 1.715e+03, threshold=1.161e+03, percent-clipped=9.0 2023-06-26 00:37:06,156 INFO [train.py:996] (2/4) Epoch 8, batch 28850, loss[loss=0.2788, simple_loss=0.3296, pruned_loss=0.1141, over 21546.00 frames. ], tot_loss[loss=0.227, simple_loss=0.3009, pruned_loss=0.07656, over 4276789.25 frames. ], batch size: 471, lr: 3.62e-03, grad_scale: 32.0 2023-06-26 00:37:44,453 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1453932.0, ans=0.125 2023-06-26 00:39:02,771 INFO [train.py:996] (2/4) Epoch 8, batch 28900, loss[loss=0.2951, simple_loss=0.3627, pruned_loss=0.1138, over 21520.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3037, pruned_loss=0.07835, over 4275202.25 frames. ], batch size: 471, lr: 3.62e-03, grad_scale: 32.0 2023-06-26 00:39:30,793 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.15 vs. limit=15.0 2023-06-26 00:39:55,852 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1454292.0, ans=0.2 2023-06-26 00:40:06,712 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1454292.0, ans=0.2 2023-06-26 00:40:21,071 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1454352.0, ans=0.0 2023-06-26 00:40:24,480 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1454352.0, ans=0.0 2023-06-26 00:40:36,721 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.510e+02 4.525e+02 6.150e+02 8.317e+02 2.231e+03, threshold=1.230e+03, percent-clipped=10.0 2023-06-26 00:40:57,575 INFO [train.py:996] (2/4) Epoch 8, batch 28950, loss[loss=0.2033, simple_loss=0.3008, pruned_loss=0.05288, over 21837.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3081, pruned_loss=0.07873, over 4270982.68 frames. ], batch size: 316, lr: 3.62e-03, grad_scale: 32.0 2023-06-26 00:41:17,174 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.78 vs. limit=10.0 2023-06-26 00:41:34,724 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1454532.0, ans=0.125 2023-06-26 00:41:57,348 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.28 vs. limit=15.0 2023-06-26 00:41:58,945 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.23 vs. limit=22.5 2023-06-26 00:42:00,245 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1454592.0, ans=0.125 2023-06-26 00:42:52,316 INFO [train.py:996] (2/4) Epoch 8, batch 29000, loss[loss=0.2325, simple_loss=0.3071, pruned_loss=0.079, over 21999.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3122, pruned_loss=0.07846, over 4266953.71 frames. ], batch size: 317, lr: 3.62e-03, grad_scale: 16.0 2023-06-26 00:42:58,066 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 00:43:12,492 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1454772.0, ans=0.125 2023-06-26 00:43:17,730 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1454832.0, ans=0.125 2023-06-26 00:43:45,207 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 00:44:24,400 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1455012.0, ans=0.2 2023-06-26 00:44:25,403 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.229e+02 4.694e+02 5.564e+02 8.456e+02 2.061e+03, threshold=1.113e+03, percent-clipped=6.0 2023-06-26 00:44:38,557 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1455072.0, ans=10.0 2023-06-26 00:44:39,536 INFO [train.py:996] (2/4) Epoch 8, batch 29050, loss[loss=0.2131, simple_loss=0.2907, pruned_loss=0.06773, over 21526.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3096, pruned_loss=0.07873, over 4275754.33 frames. ], batch size: 131, lr: 3.62e-03, grad_scale: 16.0 2023-06-26 00:44:59,553 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1455072.0, ans=0.09899494936611666 2023-06-26 00:45:16,644 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1455132.0, ans=0.0 2023-06-26 00:45:46,097 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1455252.0, ans=0.125 2023-06-26 00:45:46,248 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1455252.0, ans=0.2 2023-06-26 00:46:27,348 INFO [train.py:996] (2/4) Epoch 8, batch 29100, loss[loss=0.245, simple_loss=0.2843, pruned_loss=0.1029, over 21429.00 frames. ], tot_loss[loss=0.2266, simple_loss=0.3003, pruned_loss=0.0764, over 4269745.29 frames. ], batch size: 509, lr: 3.62e-03, grad_scale: 16.0 2023-06-26 00:47:57,108 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.65 vs. limit=10.0 2023-06-26 00:48:06,921 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.913e+02 4.309e+02 6.273e+02 8.461e+02 1.678e+03, threshold=1.255e+03, percent-clipped=7.0 2023-06-26 00:48:15,343 INFO [train.py:996] (2/4) Epoch 8, batch 29150, loss[loss=0.207, simple_loss=0.2782, pruned_loss=0.06787, over 21257.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.2994, pruned_loss=0.07448, over 4271666.75 frames. ], batch size: 159, lr: 3.62e-03, grad_scale: 16.0 2023-06-26 00:48:25,735 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.89 vs. limit=12.0 2023-06-26 00:48:59,400 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1455792.0, ans=0.2 2023-06-26 00:50:08,215 INFO [train.py:996] (2/4) Epoch 8, batch 29200, loss[loss=0.221, simple_loss=0.2787, pruned_loss=0.08158, over 21557.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.2938, pruned_loss=0.07257, over 4270151.27 frames. ], batch size: 414, lr: 3.62e-03, grad_scale: 32.0 2023-06-26 00:50:23,102 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1455972.0, ans=0.125 2023-06-26 00:50:35,887 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=1456032.0, ans=6.0 2023-06-26 00:51:08,798 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.55 vs. limit=22.5 2023-06-26 00:51:42,006 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.239e+02 4.282e+02 5.514e+02 8.024e+02 1.461e+03, threshold=1.103e+03, percent-clipped=3.0 2023-06-26 00:51:56,521 INFO [train.py:996] (2/4) Epoch 8, batch 29250, loss[loss=0.1812, simple_loss=0.2577, pruned_loss=0.05234, over 21773.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2921, pruned_loss=0.07025, over 4278374.18 frames. ], batch size: 118, lr: 3.62e-03, grad_scale: 32.0 2023-06-26 00:52:01,090 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1456272.0, ans=0.0 2023-06-26 00:52:11,796 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 00:53:10,780 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1456452.0, ans=0.0 2023-06-26 00:53:44,028 INFO [train.py:996] (2/4) Epoch 8, batch 29300, loss[loss=0.2513, simple_loss=0.3002, pruned_loss=0.1012, over 21345.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.293, pruned_loss=0.06961, over 4270728.91 frames. ], batch size: 507, lr: 3.62e-03, grad_scale: 16.0 2023-06-26 00:54:24,930 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1456692.0, ans=0.0 2023-06-26 00:54:36,712 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1456692.0, ans=0.125 2023-06-26 00:55:25,851 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.817e+02 4.100e+02 5.558e+02 8.472e+02 2.092e+03, threshold=1.112e+03, percent-clipped=11.0 2023-06-26 00:55:29,915 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1456812.0, ans=0.125 2023-06-26 00:55:32,116 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.89 vs. limit=12.0 2023-06-26 00:55:32,611 INFO [train.py:996] (2/4) Epoch 8, batch 29350, loss[loss=0.1942, simple_loss=0.2702, pruned_loss=0.05905, over 21248.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.2886, pruned_loss=0.06853, over 4272307.93 frames. ], batch size: 144, lr: 3.62e-03, grad_scale: 16.0 2023-06-26 00:55:36,894 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1456872.0, ans=0.0 2023-06-26 00:55:40,598 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1456872.0, ans=0.2 2023-06-26 00:55:42,294 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1456872.0, ans=0.2 2023-06-26 00:55:52,873 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1456932.0, ans=0.125 2023-06-26 00:55:59,818 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1456932.0, ans=0.2 2023-06-26 00:56:46,425 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1457052.0, ans=0.125 2023-06-26 00:56:58,946 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1457052.0, ans=0.125 2023-06-26 00:57:21,106 INFO [train.py:996] (2/4) Epoch 8, batch 29400, loss[loss=0.1825, simple_loss=0.2564, pruned_loss=0.05423, over 21663.00 frames. ], tot_loss[loss=0.2107, simple_loss=0.2878, pruned_loss=0.06684, over 4267022.06 frames. ], batch size: 247, lr: 3.62e-03, grad_scale: 16.0 2023-06-26 00:58:48,458 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.28 vs. limit=15.0 2023-06-26 00:59:02,200 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.050e+02 4.516e+02 7.158e+02 1.067e+03 2.108e+03, threshold=1.432e+03, percent-clipped=22.0 2023-06-26 00:59:09,206 INFO [train.py:996] (2/4) Epoch 8, batch 29450, loss[loss=0.2522, simple_loss=0.3225, pruned_loss=0.09099, over 21723.00 frames. ], tot_loss[loss=0.2088, simple_loss=0.2852, pruned_loss=0.06616, over 4268328.25 frames. ], batch size: 351, lr: 3.62e-03, grad_scale: 16.0 2023-06-26 00:59:26,683 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1457472.0, ans=0.2 2023-06-26 00:59:33,377 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1457532.0, ans=10.0 2023-06-26 00:59:39,198 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.28 vs. limit=15.0 2023-06-26 01:00:24,307 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1457652.0, ans=0.125 2023-06-26 01:00:56,335 INFO [train.py:996] (2/4) Epoch 8, batch 29500, loss[loss=0.218, simple_loss=0.2826, pruned_loss=0.07673, over 21340.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.291, pruned_loss=0.06972, over 4270820.50 frames. ], batch size: 176, lr: 3.62e-03, grad_scale: 16.0 2023-06-26 01:00:57,089 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 01:01:07,523 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.38 vs. limit=15.0 2023-06-26 01:01:10,516 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1457772.0, ans=0.125 2023-06-26 01:01:17,508 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1457832.0, ans=0.0 2023-06-26 01:01:48,305 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1457892.0, ans=0.05 2023-06-26 01:02:09,611 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1457952.0, ans=0.1 2023-06-26 01:02:17,908 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1457952.0, ans=0.125 2023-06-26 01:02:25,045 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1458012.0, ans=0.1 2023-06-26 01:02:25,069 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1458012.0, ans=0.125 2023-06-26 01:02:36,145 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.293e+02 4.544e+02 5.932e+02 7.825e+02 1.489e+03, threshold=1.186e+03, percent-clipped=1.0 2023-06-26 01:02:38,517 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1458012.0, ans=10.0 2023-06-26 01:02:42,878 INFO [train.py:996] (2/4) Epoch 8, batch 29550, loss[loss=0.2604, simple_loss=0.3331, pruned_loss=0.09389, over 21874.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.2911, pruned_loss=0.07178, over 4280071.01 frames. ], batch size: 107, lr: 3.62e-03, grad_scale: 16.0 2023-06-26 01:02:56,058 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1458072.0, ans=0.125 2023-06-26 01:03:02,820 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1458072.0, ans=0.1 2023-06-26 01:03:03,590 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.68 vs. limit=15.0 2023-06-26 01:04:08,819 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff2.min_abs, batch_count=1458252.0, ans=0.1 2023-06-26 01:04:17,668 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1458312.0, ans=0.0 2023-06-26 01:04:31,709 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1458312.0, ans=0.125 2023-06-26 01:04:33,366 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1458312.0, ans=0.025 2023-06-26 01:04:40,148 INFO [train.py:996] (2/4) Epoch 8, batch 29600, loss[loss=0.2297, simple_loss=0.3242, pruned_loss=0.06763, over 21665.00 frames. ], tot_loss[loss=0.2215, simple_loss=0.2964, pruned_loss=0.07333, over 4277292.89 frames. ], batch size: 263, lr: 3.62e-03, grad_scale: 32.0 2023-06-26 01:05:16,118 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1458432.0, ans=0.125 2023-06-26 01:05:28,436 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1458492.0, ans=0.125 2023-06-26 01:05:39,226 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1458492.0, ans=0.125 2023-06-26 01:06:20,228 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1458612.0, ans=0.1 2023-06-26 01:06:21,144 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.719e+02 4.529e+02 7.554e+02 1.096e+03 2.697e+03, threshold=1.511e+03, percent-clipped=19.0 2023-06-26 01:06:27,951 INFO [train.py:996] (2/4) Epoch 8, batch 29650, loss[loss=0.1918, simple_loss=0.2631, pruned_loss=0.06022, over 21117.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.294, pruned_loss=0.07046, over 4279813.69 frames. ], batch size: 608, lr: 3.62e-03, grad_scale: 32.0 2023-06-26 01:08:05,299 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1458912.0, ans=0.125 2023-06-26 01:08:17,139 INFO [train.py:996] (2/4) Epoch 8, batch 29700, loss[loss=0.2171, simple_loss=0.3319, pruned_loss=0.05114, over 19809.00 frames. ], tot_loss[loss=0.2172, simple_loss=0.2939, pruned_loss=0.07021, over 4279969.15 frames. ], batch size: 702, lr: 3.62e-03, grad_scale: 32.0 2023-06-26 01:08:17,716 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1458972.0, ans=0.0 2023-06-26 01:08:50,544 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 01:08:50,552 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1459032.0, ans=0.1 2023-06-26 01:08:53,588 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1459032.0, ans=0.125 2023-06-26 01:09:25,809 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1459152.0, ans=0.2 2023-06-26 01:09:34,325 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1459152.0, ans=0.1 2023-06-26 01:09:57,648 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.231e+02 4.516e+02 5.860e+02 9.248e+02 1.775e+03, threshold=1.172e+03, percent-clipped=6.0 2023-06-26 01:10:04,574 INFO [train.py:996] (2/4) Epoch 8, batch 29750, loss[loss=0.2214, simple_loss=0.2867, pruned_loss=0.07804, over 21437.00 frames. ], tot_loss[loss=0.2199, simple_loss=0.2996, pruned_loss=0.07011, over 4277426.86 frames. ], batch size: 144, lr: 3.62e-03, grad_scale: 32.0 2023-06-26 01:10:19,556 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1459272.0, ans=0.2 2023-06-26 01:10:42,407 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1459332.0, ans=0.125 2023-06-26 01:11:26,046 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1459452.0, ans=0.1 2023-06-26 01:11:51,244 INFO [train.py:996] (2/4) Epoch 8, batch 29800, loss[loss=0.2394, simple_loss=0.3084, pruned_loss=0.08516, over 21366.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.3015, pruned_loss=0.07137, over 4289576.49 frames. ], batch size: 159, lr: 3.62e-03, grad_scale: 16.0 2023-06-26 01:12:15,561 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1459572.0, ans=0.1 2023-06-26 01:12:29,481 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1459632.0, ans=0.1 2023-06-26 01:13:09,377 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1459752.0, ans=0.1 2023-06-26 01:13:13,386 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.whiten.whitening_limit, batch_count=1459752.0, ans=12.0 2023-06-26 01:13:32,347 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.767e+02 3.928e+02 4.572e+02 6.290e+02 1.025e+03, threshold=9.144e+02, percent-clipped=0.0 2023-06-26 01:13:37,471 INFO [train.py:996] (2/4) Epoch 8, batch 29850, loss[loss=0.1878, simple_loss=0.2609, pruned_loss=0.05739, over 21830.00 frames. ], tot_loss[loss=0.2172, simple_loss=0.2969, pruned_loss=0.06879, over 4292179.34 frames. ], batch size: 247, lr: 3.62e-03, grad_scale: 16.0 2023-06-26 01:13:40,120 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.18 vs. limit=15.0 2023-06-26 01:14:37,553 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1459992.0, ans=0.125 2023-06-26 01:15:01,667 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1460112.0, ans=0.0 2023-06-26 01:15:04,294 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.39 vs. limit=22.5 2023-06-26 01:15:20,101 INFO [train.py:996] (2/4) Epoch 8, batch 29900, loss[loss=0.2373, simple_loss=0.3077, pruned_loss=0.08349, over 21593.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.2963, pruned_loss=0.07024, over 4293630.26 frames. ], batch size: 263, lr: 3.62e-03, grad_scale: 16.0 2023-06-26 01:15:23,245 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=6.28 vs. limit=22.5 2023-06-26 01:15:37,261 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.27 vs. limit=22.5 2023-06-26 01:15:41,963 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1460172.0, ans=0.1 2023-06-26 01:16:58,286 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1460412.0, ans=0.125 2023-06-26 01:17:00,282 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1460412.0, ans=0.2 2023-06-26 01:17:10,431 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.335e+02 4.671e+02 6.480e+02 9.712e+02 1.710e+03, threshold=1.296e+03, percent-clipped=28.0 2023-06-26 01:17:11,860 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.97 vs. limit=6.0 2023-06-26 01:17:15,551 INFO [train.py:996] (2/4) Epoch 8, batch 29950, loss[loss=0.2473, simple_loss=0.3225, pruned_loss=0.08604, over 21334.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.2999, pruned_loss=0.07425, over 4287481.98 frames. ], batch size: 143, lr: 3.61e-03, grad_scale: 16.0 2023-06-26 01:18:53,829 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1460712.0, ans=0.125 2023-06-26 01:19:00,383 INFO [train.py:996] (2/4) Epoch 8, batch 30000, loss[loss=0.2067, simple_loss=0.3069, pruned_loss=0.05322, over 21617.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.3016, pruned_loss=0.07459, over 4283829.99 frames. ], batch size: 414, lr: 3.61e-03, grad_scale: 32.0 2023-06-26 01:19:00,384 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-26 01:19:11,051 INFO [zipformer.py:1728] (2/4) name=encoder.encoders.3.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([2.5093, 3.2087, 3.3897, 3.6204, 3.0360, 2.9552, 3.6340, 3.6495], device='cuda:2') 2023-06-26 01:19:18,802 INFO [train.py:1028] (2/4) Epoch 8, validation: loss=0.2464, simple_loss=0.3452, pruned_loss=0.07378, over 1796401.00 frames. 2023-06-26 01:19:18,803 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 23731MB 2023-06-26 01:21:14,232 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.23 vs. limit=15.0 2023-06-26 01:21:14,675 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.858e+02 4.174e+02 5.657e+02 7.922e+02 1.669e+03, threshold=1.131e+03, percent-clipped=1.0 2023-06-26 01:21:20,158 INFO [train.py:996] (2/4) Epoch 8, batch 30050, loss[loss=0.3306, simple_loss=0.4221, pruned_loss=0.1196, over 21388.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.3064, pruned_loss=0.07191, over 4279385.15 frames. ], batch size: 507, lr: 3.61e-03, grad_scale: 32.0 2023-06-26 01:21:28,253 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.67 vs. limit=10.0 2023-06-26 01:22:13,174 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1461192.0, ans=0.125 2023-06-26 01:22:22,067 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1461192.0, ans=0.125 2023-06-26 01:22:25,740 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1461192.0, ans=0.0 2023-06-26 01:22:27,342 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 01:22:30,916 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1461252.0, ans=0.2 2023-06-26 01:22:39,590 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.89 vs. limit=15.0 2023-06-26 01:22:50,502 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1461312.0, ans=0.125 2023-06-26 01:23:13,776 INFO [train.py:996] (2/4) Epoch 8, batch 30100, loss[loss=0.1898, simple_loss=0.2617, pruned_loss=0.059, over 21489.00 frames. ], tot_loss[loss=0.223, simple_loss=0.3039, pruned_loss=0.07103, over 4280663.79 frames. ], batch size: 212, lr: 3.61e-03, grad_scale: 16.0 2023-06-26 01:23:52,178 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1461492.0, ans=0.125 2023-06-26 01:24:53,945 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.090e+02 4.517e+02 6.270e+02 9.720e+02 3.054e+03, threshold=1.254e+03, percent-clipped=16.0 2023-06-26 01:24:57,515 INFO [train.py:996] (2/4) Epoch 8, batch 30150, loss[loss=0.2634, simple_loss=0.3266, pruned_loss=0.1, over 21789.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.3, pruned_loss=0.07236, over 4277579.74 frames. ], batch size: 441, lr: 3.61e-03, grad_scale: 16.0 2023-06-26 01:25:05,203 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1461672.0, ans=0.125 2023-06-26 01:25:39,434 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.34 vs. limit=15.0 2023-06-26 01:26:51,049 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.08 vs. limit=12.0 2023-06-26 01:26:53,760 INFO [train.py:996] (2/4) Epoch 8, batch 30200, loss[loss=0.2972, simple_loss=0.3739, pruned_loss=0.1102, over 21362.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.3036, pruned_loss=0.07177, over 4280959.41 frames. ], batch size: 507, lr: 3.61e-03, grad_scale: 16.0 2023-06-26 01:26:56,227 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1461972.0, ans=0.125 2023-06-26 01:27:09,689 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.91 vs. limit=15.0 2023-06-26 01:27:37,704 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1462032.0, ans=0.125 2023-06-26 01:27:57,443 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1462092.0, ans=0.2 2023-06-26 01:28:45,643 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.354e+02 5.048e+02 7.227e+02 1.023e+03 2.150e+03, threshold=1.445e+03, percent-clipped=15.0 2023-06-26 01:28:48,922 INFO [train.py:996] (2/4) Epoch 8, batch 30250, loss[loss=0.3186, simple_loss=0.4035, pruned_loss=0.1169, over 21509.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.3111, pruned_loss=0.07406, over 4275399.46 frames. ], batch size: 471, lr: 3.61e-03, grad_scale: 16.0 2023-06-26 01:29:06,418 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1462272.0, ans=0.1 2023-06-26 01:29:23,595 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1462332.0, ans=0.0 2023-06-26 01:30:14,064 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.72 vs. limit=6.0 2023-06-26 01:30:36,872 INFO [train.py:996] (2/4) Epoch 8, batch 30300, loss[loss=0.2096, simple_loss=0.2704, pruned_loss=0.07439, over 21397.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.3088, pruned_loss=0.07366, over 4277885.80 frames. ], batch size: 389, lr: 3.61e-03, grad_scale: 16.0 2023-06-26 01:30:37,693 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 01:30:41,293 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1462572.0, ans=0.1 2023-06-26 01:31:39,349 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1462692.0, ans=0.125 2023-06-26 01:32:08,173 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1462752.0, ans=0.125 2023-06-26 01:32:13,648 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1462812.0, ans=0.1 2023-06-26 01:32:31,190 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.196e+02 5.174e+02 6.761e+02 1.021e+03 2.632e+03, threshold=1.352e+03, percent-clipped=10.0 2023-06-26 01:32:34,777 INFO [train.py:996] (2/4) Epoch 8, batch 30350, loss[loss=0.2315, simple_loss=0.3199, pruned_loss=0.07154, over 21731.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.3078, pruned_loss=0.07504, over 4270083.25 frames. ], batch size: 298, lr: 3.61e-03, grad_scale: 16.0 2023-06-26 01:33:06,705 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1462932.0, ans=0.2 2023-06-26 01:33:18,356 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1462992.0, ans=0.2 2023-06-26 01:33:22,069 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.39 vs. limit=15.0 2023-06-26 01:33:23,057 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1462992.0, ans=0.0 2023-06-26 01:33:56,342 INFO [train.py:996] (2/4) Epoch 8, batch 30400, loss[loss=0.201, simple_loss=0.2524, pruned_loss=0.07485, over 20268.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.3005, pruned_loss=0.07333, over 4257763.84 frames. ], batch size: 703, lr: 3.61e-03, grad_scale: 32.0 2023-06-26 01:34:05,244 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1463172.0, ans=0.0 2023-06-26 01:34:20,358 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1463232.0, ans=0.0 2023-06-26 01:34:24,709 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1463232.0, ans=0.125 2023-06-26 01:34:32,535 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1463232.0, ans=0.0 2023-06-26 01:35:04,051 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1463352.0, ans=0.125 2023-06-26 01:35:06,937 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1463412.0, ans=0.125 2023-06-26 01:35:24,312 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.064e+02 6.383e+02 1.075e+03 1.632e+03 7.193e+03, threshold=2.149e+03, percent-clipped=36.0 2023-06-26 01:35:25,764 INFO [train.py:996] (2/4) Epoch 8, batch 30450, loss[loss=0.265, simple_loss=0.3834, pruned_loss=0.07325, over 19830.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.3014, pruned_loss=0.07302, over 4198498.28 frames. ], batch size: 702, lr: 3.61e-03, grad_scale: 16.0 2023-06-26 01:35:29,855 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1463472.0, ans=0.1 2023-06-26 01:35:45,046 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1463532.0, ans=0.125 2023-06-26 01:36:10,159 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1463592.0, ans=0.2 2023-06-26 01:36:29,677 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1463652.0, ans=0.1 2023-06-26 01:36:33,397 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1463712.0, ans=0.0 2023-06-26 01:36:36,365 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 01:38:50,980 INFO [train.py:996] (2/4) Epoch 9, batch 0, loss[loss=0.2025, simple_loss=0.2747, pruned_loss=0.06514, over 21623.00 frames. ], tot_loss[loss=0.2025, simple_loss=0.2747, pruned_loss=0.06514, over 21623.00 frames. ], batch size: 298, lr: 3.39e-03, grad_scale: 32.0 2023-06-26 01:38:50,981 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-26 01:39:14,236 INFO [train.py:1028] (2/4) Epoch 9, validation: loss=0.2395, simple_loss=0.3459, pruned_loss=0.06656, over 1796401.00 frames. 2023-06-26 01:39:14,237 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 23731MB 2023-06-26 01:39:27,294 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1463742.0, ans=0.125 2023-06-26 01:40:30,991 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1463922.0, ans=0.1 2023-06-26 01:40:59,331 INFO [train.py:996] (2/4) Epoch 9, batch 50, loss[loss=0.242, simple_loss=0.3216, pruned_loss=0.08121, over 21760.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.3112, pruned_loss=0.07473, over 972295.33 frames. ], batch size: 247, lr: 3.39e-03, grad_scale: 16.0 2023-06-26 01:41:10,785 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1464042.0, ans=0.0 2023-06-26 01:41:13,454 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.197e+02 4.855e+02 1.072e+03 2.293e+03 5.497e+03, threshold=2.144e+03, percent-clipped=28.0 2023-06-26 01:41:34,967 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1464102.0, ans=0.125 2023-06-26 01:42:03,531 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1464162.0, ans=0.0 2023-06-26 01:42:09,079 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.10 vs. limit=22.5 2023-06-26 01:42:40,953 INFO [train.py:996] (2/4) Epoch 9, batch 100, loss[loss=0.2656, simple_loss=0.3571, pruned_loss=0.08704, over 21638.00 frames. ], tot_loss[loss=0.238, simple_loss=0.324, pruned_loss=0.07604, over 1693859.09 frames. ], batch size: 414, lr: 3.39e-03, grad_scale: 16.0 2023-06-26 01:43:48,374 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.88 vs. limit=12.0 2023-06-26 01:44:01,344 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 01:44:10,908 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.15 vs. limit=6.0 2023-06-26 01:44:18,304 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1464582.0, ans=0.0 2023-06-26 01:44:26,128 INFO [train.py:996] (2/4) Epoch 9, batch 150, loss[loss=0.2182, simple_loss=0.2961, pruned_loss=0.07018, over 21908.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3228, pruned_loss=0.07492, over 2260138.32 frames. ], batch size: 316, lr: 3.39e-03, grad_scale: 16.0 2023-06-26 01:44:40,675 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.204e+02 4.415e+02 5.834e+02 7.944e+02 1.480e+03, threshold=1.167e+03, percent-clipped=0.0 2023-06-26 01:44:57,209 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1464702.0, ans=0.04949747468305833 2023-06-26 01:45:38,761 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1464822.0, ans=0.0 2023-06-26 01:45:46,885 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1464822.0, ans=0.05 2023-06-26 01:46:13,218 INFO [train.py:996] (2/4) Epoch 9, batch 200, loss[loss=0.2342, simple_loss=0.2966, pruned_loss=0.08584, over 21696.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.318, pruned_loss=0.07389, over 2688724.18 frames. ], batch size: 230, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 01:46:40,114 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1465002.0, ans=0.1 2023-06-26 01:46:58,437 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1465002.0, ans=0.5 2023-06-26 01:47:37,558 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.90 vs. limit=10.0 2023-06-26 01:48:00,436 INFO [train.py:996] (2/4) Epoch 9, batch 250, loss[loss=0.2121, simple_loss=0.2782, pruned_loss=0.07303, over 21502.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3149, pruned_loss=0.07439, over 3040868.89 frames. ], batch size: 194, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 01:48:08,788 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.143e+02 4.378e+02 6.069e+02 8.721e+02 1.562e+03, threshold=1.214e+03, percent-clipped=10.0 2023-06-26 01:48:30,319 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.07 vs. limit=15.0 2023-06-26 01:48:36,362 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1465302.0, ans=0.035 2023-06-26 01:49:11,070 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1465362.0, ans=0.0 2023-06-26 01:49:16,303 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1465422.0, ans=0.125 2023-06-26 01:49:45,254 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1465482.0, ans=0.07 2023-06-26 01:49:50,366 INFO [train.py:996] (2/4) Epoch 9, batch 300, loss[loss=0.2611, simple_loss=0.3786, pruned_loss=0.07178, over 19785.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.3099, pruned_loss=0.07362, over 3308844.45 frames. ], batch size: 702, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 01:50:30,238 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1465602.0, ans=0.0 2023-06-26 01:50:30,807 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.33 vs. limit=6.0 2023-06-26 01:50:37,120 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1465602.0, ans=0.125 2023-06-26 01:51:41,303 INFO [train.py:996] (2/4) Epoch 9, batch 350, loss[loss=0.1912, simple_loss=0.2587, pruned_loss=0.06185, over 21213.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.3034, pruned_loss=0.07156, over 3515306.48 frames. ], batch size: 144, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 01:51:47,249 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1465842.0, ans=0.125 2023-06-26 01:51:50,487 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.975e+02 4.637e+02 6.282e+02 9.202e+02 1.945e+03, threshold=1.256e+03, percent-clipped=12.0 2023-06-26 01:52:30,219 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1465902.0, ans=0.125 2023-06-26 01:53:30,988 INFO [train.py:996] (2/4) Epoch 9, batch 400, loss[loss=0.1778, simple_loss=0.2427, pruned_loss=0.05644, over 21206.00 frames. ], tot_loss[loss=0.2183, simple_loss=0.2959, pruned_loss=0.07039, over 3676841.60 frames. ], batch size: 159, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 01:53:33,288 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1466142.0, ans=0.125 2023-06-26 01:54:35,485 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1466262.0, ans=10.0 2023-06-26 01:54:49,478 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1466322.0, ans=0.1 2023-06-26 01:54:54,952 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1466322.0, ans=0.5 2023-06-26 01:55:19,594 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1466442.0, ans=0.1 2023-06-26 01:55:19,692 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1466442.0, ans=0.0 2023-06-26 01:55:20,924 INFO [train.py:996] (2/4) Epoch 9, batch 450, loss[loss=0.2299, simple_loss=0.3102, pruned_loss=0.07481, over 21981.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.2942, pruned_loss=0.069, over 3812680.16 frames. ], batch size: 113, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 01:55:41,295 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.313e+02 4.889e+02 7.953e+02 1.170e+03 2.853e+03, threshold=1.591e+03, percent-clipped=21.0 2023-06-26 01:55:43,586 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1466442.0, ans=0.2 2023-06-26 01:55:53,018 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1466502.0, ans=0.1 2023-06-26 01:56:33,369 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.78 vs. limit=10.0 2023-06-26 01:56:45,425 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1466622.0, ans=0.125 2023-06-26 01:56:56,576 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.68 vs. limit=10.0 2023-06-26 01:57:13,998 INFO [train.py:996] (2/4) Epoch 9, batch 500, loss[loss=0.1821, simple_loss=0.2779, pruned_loss=0.04313, over 21691.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2963, pruned_loss=0.06848, over 3921788.93 frames. ], batch size: 298, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 01:57:20,966 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.23 vs. limit=15.0 2023-06-26 01:57:22,246 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1466742.0, ans=0.0 2023-06-26 01:57:43,306 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1466802.0, ans=0.2 2023-06-26 01:58:16,126 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1466862.0, ans=0.2 2023-06-26 01:58:40,573 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1466982.0, ans=0.0 2023-06-26 01:59:08,347 INFO [train.py:996] (2/4) Epoch 9, batch 550, loss[loss=0.2693, simple_loss=0.3585, pruned_loss=0.09, over 21761.00 frames. ], tot_loss[loss=0.2192, simple_loss=0.3017, pruned_loss=0.06837, over 3999751.32 frames. ], batch size: 351, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 01:59:25,223 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.991e+02 4.595e+02 7.824e+02 1.104e+03 2.417e+03, threshold=1.565e+03, percent-clipped=11.0 2023-06-26 01:59:27,580 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1467042.0, ans=0.125 2023-06-26 01:59:33,480 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.76 vs. limit=22.5 2023-06-26 02:00:20,294 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.85 vs. limit=15.0 2023-06-26 02:00:23,590 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.35 vs. limit=22.5 2023-06-26 02:01:03,298 INFO [train.py:996] (2/4) Epoch 9, batch 600, loss[loss=0.2707, simple_loss=0.3217, pruned_loss=0.1098, over 21813.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.3026, pruned_loss=0.06907, over 4070425.75 frames. ], batch size: 508, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 02:01:10,890 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1467342.0, ans=0.025 2023-06-26 02:01:28,952 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.19 vs. limit=15.0 2023-06-26 02:01:34,269 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.98 vs. limit=12.0 2023-06-26 02:01:37,722 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.46 vs. limit=10.0 2023-06-26 02:01:38,839 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1467402.0, ans=0.0 2023-06-26 02:01:40,888 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1467402.0, ans=0.125 2023-06-26 02:01:49,242 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1467462.0, ans=0.1 2023-06-26 02:02:40,687 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1467642.0, ans=0.125 2023-06-26 02:02:47,023 INFO [train.py:996] (2/4) Epoch 9, batch 650, loss[loss=0.1976, simple_loss=0.2486, pruned_loss=0.07327, over 20065.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.3035, pruned_loss=0.06958, over 4116370.79 frames. ], batch size: 704, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 02:03:03,590 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.173e+02 5.371e+02 7.433e+02 1.361e+03 3.228e+03, threshold=1.487e+03, percent-clipped=18.0 2023-06-26 02:03:12,309 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1467702.0, ans=0.1 2023-06-26 02:03:39,465 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1467762.0, ans=0.125 2023-06-26 02:03:42,975 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1467762.0, ans=0.125 2023-06-26 02:03:45,455 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.03 vs. limit=15.0 2023-06-26 02:03:46,514 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1467762.0, ans=0.0 2023-06-26 02:03:58,765 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.94 vs. limit=22.5 2023-06-26 02:04:15,584 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1467882.0, ans=0.125 2023-06-26 02:04:35,447 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1467882.0, ans=0.125 2023-06-26 02:04:44,117 INFO [train.py:996] (2/4) Epoch 9, batch 700, loss[loss=0.2325, simple_loss=0.3164, pruned_loss=0.07437, over 21499.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.3, pruned_loss=0.06863, over 4150573.45 frames. ], batch size: 211, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 02:05:15,176 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.93 vs. limit=15.0 2023-06-26 02:05:45,282 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1468122.0, ans=0.2 2023-06-26 02:06:00,829 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1468182.0, ans=0.0 2023-06-26 02:06:31,667 INFO [train.py:996] (2/4) Epoch 9, batch 750, loss[loss=0.2064, simple_loss=0.2786, pruned_loss=0.06715, over 21840.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.2987, pruned_loss=0.06902, over 4185261.30 frames. ], batch size: 118, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 02:06:42,117 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.730e+02 4.754e+02 6.417e+02 9.585e+02 1.882e+03, threshold=1.283e+03, percent-clipped=6.0 2023-06-26 02:07:10,366 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1468362.0, ans=0.0 2023-06-26 02:07:52,812 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1468482.0, ans=0.2 2023-06-26 02:08:10,175 INFO [train.py:996] (2/4) Epoch 9, batch 800, loss[loss=0.2073, simple_loss=0.2715, pruned_loss=0.07153, over 21464.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.2954, pruned_loss=0.06958, over 4206289.36 frames. ], batch size: 195, lr: 3.38e-03, grad_scale: 32.0 2023-06-26 02:08:11,048 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1468542.0, ans=0.125 2023-06-26 02:09:24,672 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.20 vs. limit=15.0 2023-06-26 02:10:10,650 INFO [train.py:996] (2/4) Epoch 9, batch 850, loss[loss=0.2601, simple_loss=0.3173, pruned_loss=0.1014, over 21605.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2948, pruned_loss=0.06993, over 4222852.01 frames. ], batch size: 471, lr: 3.38e-03, grad_scale: 32.0 2023-06-26 02:10:20,013 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1468842.0, ans=0.0 2023-06-26 02:10:26,284 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.492e+02 5.225e+02 7.900e+02 1.161e+03 2.208e+03, threshold=1.580e+03, percent-clipped=19.0 2023-06-26 02:10:27,190 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1468842.0, ans=0.125 2023-06-26 02:10:36,038 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1468902.0, ans=0.2 2023-06-26 02:10:52,039 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1468962.0, ans=0.125 2023-06-26 02:10:52,785 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.32 vs. limit=22.5 2023-06-26 02:11:13,360 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1469022.0, ans=0.0 2023-06-26 02:11:59,348 INFO [train.py:996] (2/4) Epoch 9, batch 900, loss[loss=0.1851, simple_loss=0.2501, pruned_loss=0.06007, over 21304.00 frames. ], tot_loss[loss=0.216, simple_loss=0.2931, pruned_loss=0.06943, over 4235521.07 frames. ], batch size: 160, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 02:12:55,997 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1469322.0, ans=0.1 2023-06-26 02:13:00,378 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.18 vs. limit=15.0 2023-06-26 02:13:01,536 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1469322.0, ans=0.0 2023-06-26 02:13:46,174 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1469382.0, ans=0.125 2023-06-26 02:13:48,842 INFO [train.py:996] (2/4) Epoch 9, batch 950, loss[loss=0.2129, simple_loss=0.2707, pruned_loss=0.07753, over 21328.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2925, pruned_loss=0.0691, over 4248150.57 frames. ], batch size: 143, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 02:13:55,248 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.97 vs. limit=15.0 2023-06-26 02:14:01,424 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.026e+02 4.404e+02 7.084e+02 1.100e+03 2.197e+03, threshold=1.417e+03, percent-clipped=5.0 2023-06-26 02:14:24,452 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1469562.0, ans=0.0 2023-06-26 02:14:44,647 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1469622.0, ans=0.125 2023-06-26 02:14:59,829 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.45 vs. limit=15.0 2023-06-26 02:15:23,954 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.01 vs. limit=6.0 2023-06-26 02:15:36,817 INFO [train.py:996] (2/4) Epoch 9, batch 1000, loss[loss=0.2172, simple_loss=0.2873, pruned_loss=0.07356, over 21377.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2925, pruned_loss=0.06927, over 4258135.22 frames. ], batch size: 159, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 02:15:44,925 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1469742.0, ans=0.0 2023-06-26 02:16:02,913 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.47 vs. limit=15.0 2023-06-26 02:16:23,839 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1469862.0, ans=0.125 2023-06-26 02:16:32,563 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1469922.0, ans=0.125 2023-06-26 02:16:45,797 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.71 vs. limit=15.0 2023-06-26 02:17:27,475 INFO [train.py:996] (2/4) Epoch 9, batch 1050, loss[loss=0.2042, simple_loss=0.2779, pruned_loss=0.06531, over 21642.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2926, pruned_loss=0.06922, over 4264343.73 frames. ], batch size: 263, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 02:17:39,393 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.206e+02 4.347e+02 6.082e+02 9.446e+02 2.534e+03, threshold=1.216e+03, percent-clipped=8.0 2023-06-26 02:18:12,898 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.64 vs. limit=22.5 2023-06-26 02:18:28,029 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1470222.0, ans=0.1 2023-06-26 02:19:12,223 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1470282.0, ans=0.0 2023-06-26 02:19:18,790 INFO [train.py:996] (2/4) Epoch 9, batch 1100, loss[loss=0.2301, simple_loss=0.3205, pruned_loss=0.06984, over 21693.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2935, pruned_loss=0.06874, over 4268376.58 frames. ], batch size: 351, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 02:19:21,798 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.10 vs. limit=10.0 2023-06-26 02:19:34,191 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.10 vs. limit=10.0 2023-06-26 02:19:40,693 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1470402.0, ans=0.125 2023-06-26 02:21:06,701 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1470582.0, ans=0.0 2023-06-26 02:21:09,480 INFO [train.py:996] (2/4) Epoch 9, batch 1150, loss[loss=0.1666, simple_loss=0.2508, pruned_loss=0.04115, over 21416.00 frames. ], tot_loss[loss=0.216, simple_loss=0.2939, pruned_loss=0.06905, over 4276534.78 frames. ], batch size: 211, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 02:21:22,271 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.381e+02 4.817e+02 6.167e+02 1.033e+03 2.052e+03, threshold=1.233e+03, percent-clipped=13.0 2023-06-26 02:21:38,887 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1470702.0, ans=0.125 2023-06-26 02:21:40,423 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1470702.0, ans=0.2 2023-06-26 02:22:19,159 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1470762.0, ans=0.125 2023-06-26 02:23:00,184 INFO [train.py:996] (2/4) Epoch 9, batch 1200, loss[loss=0.2014, simple_loss=0.2805, pruned_loss=0.06116, over 21583.00 frames. ], tot_loss[loss=0.2181, simple_loss=0.2957, pruned_loss=0.07027, over 4279142.77 frames. ], batch size: 230, lr: 3.38e-03, grad_scale: 32.0 2023-06-26 02:23:14,010 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.62 vs. limit=22.5 2023-06-26 02:23:15,554 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1470942.0, ans=0.125 2023-06-26 02:24:32,105 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.74 vs. limit=15.0 2023-06-26 02:24:46,139 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1471182.0, ans=0.125 2023-06-26 02:24:52,847 INFO [train.py:996] (2/4) Epoch 9, batch 1250, loss[loss=0.1936, simple_loss=0.2739, pruned_loss=0.05661, over 21260.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.2966, pruned_loss=0.07048, over 4284882.52 frames. ], batch size: 159, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 02:24:55,809 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.56 vs. limit=22.5 2023-06-26 02:25:06,492 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.265e+02 4.578e+02 6.578e+02 9.426e+02 2.383e+03, threshold=1.316e+03, percent-clipped=14.0 2023-06-26 02:26:43,210 INFO [train.py:996] (2/4) Epoch 9, batch 1300, loss[loss=0.2342, simple_loss=0.3209, pruned_loss=0.07372, over 21833.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.2975, pruned_loss=0.07052, over 4286838.21 frames. ], batch size: 316, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 02:26:45,876 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1471542.0, ans=0.125 2023-06-26 02:26:47,997 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.18 vs. limit=10.0 2023-06-26 02:26:54,570 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1471542.0, ans=0.5 2023-06-26 02:27:04,863 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1471602.0, ans=0.0 2023-06-26 02:27:24,390 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1471662.0, ans=0.125 2023-06-26 02:28:16,142 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1471782.0, ans=0.125 2023-06-26 02:28:32,868 INFO [train.py:996] (2/4) Epoch 9, batch 1350, loss[loss=0.2197, simple_loss=0.2963, pruned_loss=0.07154, over 21882.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.2975, pruned_loss=0.07058, over 4283548.19 frames. ], batch size: 391, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 02:28:46,552 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.596e+02 4.887e+02 7.409e+02 1.206e+03 1.964e+03, threshold=1.482e+03, percent-clipped=19.0 2023-06-26 02:28:55,973 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1471902.0, ans=0.1 2023-06-26 02:29:36,902 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.96 vs. limit=6.0 2023-06-26 02:30:22,915 INFO [train.py:996] (2/4) Epoch 9, batch 1400, loss[loss=0.1882, simple_loss=0.278, pruned_loss=0.0492, over 21506.00 frames. ], tot_loss[loss=0.2192, simple_loss=0.2966, pruned_loss=0.07091, over 4291274.25 frames. ], batch size: 211, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 02:31:51,975 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.39 vs. limit=15.0 2023-06-26 02:32:09,051 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.91 vs. limit=12.0 2023-06-26 02:32:12,140 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=1472442.0, ans=0.05 2023-06-26 02:32:13,555 INFO [train.py:996] (2/4) Epoch 9, batch 1450, loss[loss=0.2446, simple_loss=0.3214, pruned_loss=0.08386, over 16207.00 frames. ], tot_loss[loss=0.22, simple_loss=0.297, pruned_loss=0.07145, over 4292248.78 frames. ], batch size: 61, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 02:32:27,125 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.492e+02 5.469e+02 8.336e+02 1.169e+03 2.052e+03, threshold=1.667e+03, percent-clipped=11.0 2023-06-26 02:32:29,642 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1472502.0, ans=0.2 2023-06-26 02:32:45,001 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1472502.0, ans=0.2 2023-06-26 02:33:02,328 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1472562.0, ans=0.1 2023-06-26 02:33:30,275 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.10 vs. limit=22.5 2023-06-26 02:33:31,536 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1472622.0, ans=0.0 2023-06-26 02:33:49,566 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.18 vs. limit=15.0 2023-06-26 02:33:51,090 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1472682.0, ans=0.125 2023-06-26 02:33:51,121 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1472682.0, ans=0.125 2023-06-26 02:33:57,841 INFO [train.py:996] (2/4) Epoch 9, batch 1500, loss[loss=0.2158, simple_loss=0.2844, pruned_loss=0.07363, over 21817.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.2969, pruned_loss=0.07216, over 4295929.67 frames. ], batch size: 247, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 02:34:11,663 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.48 vs. limit=15.0 2023-06-26 02:34:58,600 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1472862.0, ans=0.0 2023-06-26 02:35:39,820 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1472982.0, ans=0.125 2023-06-26 02:35:44,375 INFO [train.py:996] (2/4) Epoch 9, batch 1550, loss[loss=0.1905, simple_loss=0.2789, pruned_loss=0.0511, over 21662.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.2949, pruned_loss=0.07095, over 4291458.30 frames. ], batch size: 247, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 02:35:50,351 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1473042.0, ans=0.125 2023-06-26 02:35:53,210 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.68 vs. limit=15.0 2023-06-26 02:35:58,929 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.109e+02 4.360e+02 5.874e+02 7.765e+02 1.799e+03, threshold=1.175e+03, percent-clipped=2.0 2023-06-26 02:36:18,313 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=8.29 vs. limit=15.0 2023-06-26 02:36:57,033 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1473162.0, ans=0.125 2023-06-26 02:37:35,456 INFO [train.py:996] (2/4) Epoch 9, batch 1600, loss[loss=0.2529, simple_loss=0.3392, pruned_loss=0.08332, over 21815.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2931, pruned_loss=0.07048, over 4286991.72 frames. ], batch size: 371, lr: 3.38e-03, grad_scale: 32.0 2023-06-26 02:37:45,734 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.02 vs. limit=10.0 2023-06-26 02:39:04,037 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1473582.0, ans=0.125 2023-06-26 02:39:17,727 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1473582.0, ans=0.0 2023-06-26 02:39:22,843 INFO [train.py:996] (2/4) Epoch 9, batch 1650, loss[loss=0.2666, simple_loss=0.3311, pruned_loss=0.101, over 21749.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2928, pruned_loss=0.07058, over 4288633.53 frames. ], batch size: 441, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 02:39:56,130 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.260e+02 4.603e+02 6.235e+02 9.034e+02 1.719e+03, threshold=1.247e+03, percent-clipped=11.0 2023-06-26 02:40:20,148 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1473762.0, ans=0.125 2023-06-26 02:40:28,328 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=1473762.0, ans=15.0 2023-06-26 02:40:57,606 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1473882.0, ans=0.0 2023-06-26 02:41:11,347 INFO [train.py:996] (2/4) Epoch 9, batch 1700, loss[loss=0.2423, simple_loss=0.3049, pruned_loss=0.08984, over 21804.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.297, pruned_loss=0.07234, over 4289194.93 frames. ], batch size: 441, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 02:41:57,017 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.82 vs. limit=15.0 2023-06-26 02:42:04,248 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.22 vs. limit=5.0 2023-06-26 02:42:47,384 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1474182.0, ans=0.1 2023-06-26 02:43:04,393 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1474182.0, ans=0.2 2023-06-26 02:43:10,701 INFO [train.py:996] (2/4) Epoch 9, batch 1750, loss[loss=0.2066, simple_loss=0.2773, pruned_loss=0.06798, over 21697.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.2966, pruned_loss=0.07101, over 4280931.39 frames. ], batch size: 263, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 02:43:26,591 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.127e+02 4.603e+02 7.165e+02 1.089e+03 2.171e+03, threshold=1.433e+03, percent-clipped=16.0 2023-06-26 02:43:51,495 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1474362.0, ans=0.125 2023-06-26 02:44:59,058 INFO [train.py:996] (2/4) Epoch 9, batch 1800, loss[loss=0.2103, simple_loss=0.3139, pruned_loss=0.05342, over 21687.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2957, pruned_loss=0.06918, over 4279784.80 frames. ], batch size: 298, lr: 3.37e-03, grad_scale: 8.0 2023-06-26 02:45:08,629 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1474542.0, ans=0.125 2023-06-26 02:46:08,981 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1474722.0, ans=0.125 2023-06-26 02:46:17,736 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.24 vs. limit=15.0 2023-06-26 02:46:49,584 INFO [train.py:996] (2/4) Epoch 9, batch 1850, loss[loss=0.2172, simple_loss=0.2952, pruned_loss=0.06956, over 20696.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.2977, pruned_loss=0.06798, over 4274778.52 frames. ], batch size: 607, lr: 3.37e-03, grad_scale: 8.0 2023-06-26 02:46:51,957 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1474842.0, ans=0.125 2023-06-26 02:47:02,461 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1474842.0, ans=0.0 2023-06-26 02:47:07,114 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.277e+02 4.370e+02 7.147e+02 9.387e+02 1.947e+03, threshold=1.429e+03, percent-clipped=4.0 2023-06-26 02:47:18,805 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1474902.0, ans=0.125 2023-06-26 02:47:21,193 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.86 vs. limit=15.0 2023-06-26 02:47:41,602 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1475022.0, ans=0.125 2023-06-26 02:47:41,609 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1475022.0, ans=0.035 2023-06-26 02:48:08,743 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.07 vs. limit=15.0 2023-06-26 02:48:11,867 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1475022.0, ans=0.125 2023-06-26 02:48:15,755 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.02 vs. limit=15.0 2023-06-26 02:48:21,198 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.75 vs. limit=22.5 2023-06-26 02:48:35,241 INFO [train.py:996] (2/4) Epoch 9, batch 1900, loss[loss=0.2086, simple_loss=0.3019, pruned_loss=0.05766, over 21402.00 frames. ], tot_loss[loss=0.2167, simple_loss=0.2972, pruned_loss=0.0681, over 4271816.40 frames. ], batch size: 211, lr: 3.37e-03, grad_scale: 8.0 2023-06-26 02:48:39,463 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff2.min_abs, batch_count=1475142.0, ans=0.1 2023-06-26 02:48:44,336 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1475142.0, ans=0.125 2023-06-26 02:49:01,223 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.70 vs. limit=12.0 2023-06-26 02:49:05,000 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.27 vs. limit=15.0 2023-06-26 02:49:28,345 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1475262.0, ans=0.125 2023-06-26 02:50:22,002 INFO [train.py:996] (2/4) Epoch 9, batch 1950, loss[loss=0.1916, simple_loss=0.2669, pruned_loss=0.05821, over 21688.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.2924, pruned_loss=0.06647, over 4275884.58 frames. ], batch size: 333, lr: 3.37e-03, grad_scale: 8.0 2023-06-26 02:50:23,053 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1475442.0, ans=0.125 2023-06-26 02:50:39,719 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.119e+02 4.600e+02 6.101e+02 9.329e+02 1.931e+03, threshold=1.220e+03, percent-clipped=7.0 2023-06-26 02:51:37,555 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1475622.0, ans=0.125 2023-06-26 02:51:44,238 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1475622.0, ans=0.1 2023-06-26 02:51:44,773 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.47 vs. limit=6.0 2023-06-26 02:51:53,100 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1475622.0, ans=0.125 2023-06-26 02:52:13,511 INFO [train.py:996] (2/4) Epoch 9, batch 2000, loss[loss=0.1991, simple_loss=0.2757, pruned_loss=0.06129, over 21628.00 frames. ], tot_loss[loss=0.2101, simple_loss=0.2887, pruned_loss=0.06571, over 4278312.47 frames. ], batch size: 263, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 02:53:35,459 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.34 vs. limit=15.0 2023-06-26 02:53:45,536 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1475982.0, ans=10.0 2023-06-26 02:53:54,160 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1475982.0, ans=0.125 2023-06-26 02:53:58,861 INFO [train.py:996] (2/4) Epoch 9, batch 2050, loss[loss=0.2133, simple_loss=0.2806, pruned_loss=0.07297, over 21333.00 frames. ], tot_loss[loss=0.2107, simple_loss=0.2886, pruned_loss=0.06645, over 4278784.71 frames. ], batch size: 176, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 02:54:04,660 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1476042.0, ans=0.1 2023-06-26 02:54:11,409 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1476042.0, ans=0.125 2023-06-26 02:54:16,488 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.128e+02 5.298e+02 7.792e+02 1.006e+03 2.094e+03, threshold=1.558e+03, percent-clipped=16.0 2023-06-26 02:55:10,788 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1476162.0, ans=0.125 2023-06-26 02:55:53,056 INFO [train.py:996] (2/4) Epoch 9, batch 2100, loss[loss=0.2174, simple_loss=0.2896, pruned_loss=0.0726, over 21744.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2946, pruned_loss=0.06897, over 4282091.93 frames. ], batch size: 282, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 02:56:16,684 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1476402.0, ans=0.1 2023-06-26 02:57:21,015 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1476522.0, ans=0.1 2023-06-26 02:57:21,029 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1476522.0, ans=0.07 2023-06-26 02:57:24,519 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1476522.0, ans=0.1 2023-06-26 02:57:36,728 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1476582.0, ans=0.125 2023-06-26 02:57:44,908 INFO [train.py:996] (2/4) Epoch 9, batch 2150, loss[loss=0.213, simple_loss=0.2987, pruned_loss=0.06369, over 21169.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.2941, pruned_loss=0.06899, over 4282283.21 frames. ], batch size: 159, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 02:58:02,894 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.265e+02 5.087e+02 7.506e+02 1.094e+03 2.833e+03, threshold=1.501e+03, percent-clipped=11.0 2023-06-26 02:58:54,759 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1476822.0, ans=0.0 2023-06-26 02:59:31,671 INFO [train.py:996] (2/4) Epoch 9, batch 2200, loss[loss=0.2038, simple_loss=0.2989, pruned_loss=0.05435, over 21861.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2949, pruned_loss=0.06921, over 4286340.94 frames. ], batch size: 371, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 03:01:15,215 INFO [train.py:996] (2/4) Epoch 9, batch 2250, loss[loss=0.1814, simple_loss=0.2495, pruned_loss=0.05666, over 14738.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2925, pruned_loss=0.06727, over 4270971.06 frames. ], batch size: 61, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 03:01:32,993 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.139e+02 4.755e+02 7.951e+02 1.208e+03 2.238e+03, threshold=1.590e+03, percent-clipped=7.0 2023-06-26 03:01:33,716 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1477302.0, ans=0.1 2023-06-26 03:02:26,239 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1477362.0, ans=0.125 2023-06-26 03:02:50,352 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1477482.0, ans=0.0 2023-06-26 03:02:54,473 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.00 vs. limit=15.0 2023-06-26 03:03:00,499 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1477482.0, ans=0.125 2023-06-26 03:03:05,229 INFO [train.py:996] (2/4) Epoch 9, batch 2300, loss[loss=0.2061, simple_loss=0.269, pruned_loss=0.0716, over 21428.00 frames. ], tot_loss[loss=0.2118, simple_loss=0.2892, pruned_loss=0.06721, over 4259119.04 frames. ], batch size: 389, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 03:03:49,563 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.53 vs. limit=15.0 2023-06-26 03:04:19,535 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1477722.0, ans=0.1 2023-06-26 03:04:21,143 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1477722.0, ans=0.0 2023-06-26 03:04:35,783 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1477782.0, ans=0.1 2023-06-26 03:04:37,544 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1477782.0, ans=0.0 2023-06-26 03:04:50,548 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1477842.0, ans=0.125 2023-06-26 03:04:51,434 INFO [train.py:996] (2/4) Epoch 9, batch 2350, loss[loss=0.1939, simple_loss=0.2631, pruned_loss=0.06233, over 21743.00 frames. ], tot_loss[loss=0.2098, simple_loss=0.2856, pruned_loss=0.06702, over 4260223.83 frames. ], batch size: 112, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 03:05:15,073 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.161e+02 4.711e+02 6.334e+02 1.025e+03 2.139e+03, threshold=1.267e+03, percent-clipped=9.0 2023-06-26 03:06:20,720 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1478022.0, ans=0.125 2023-06-26 03:06:28,301 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.12 vs. limit=22.5 2023-06-26 03:06:44,888 INFO [train.py:996] (2/4) Epoch 9, batch 2400, loss[loss=0.1994, simple_loss=0.2684, pruned_loss=0.06518, over 21690.00 frames. ], tot_loss[loss=0.2137, simple_loss=0.2893, pruned_loss=0.06901, over 4266162.47 frames. ], batch size: 282, lr: 3.37e-03, grad_scale: 32.0 2023-06-26 03:07:28,337 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1478202.0, ans=0.0 2023-06-26 03:07:38,804 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1478262.0, ans=0.125 2023-06-26 03:08:06,698 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1478322.0, ans=0.0 2023-06-26 03:08:09,048 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.96 vs. limit=15.0 2023-06-26 03:08:26,028 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1478382.0, ans=0.125 2023-06-26 03:08:36,734 INFO [train.py:996] (2/4) Epoch 9, batch 2450, loss[loss=0.1888, simple_loss=0.2519, pruned_loss=0.06287, over 21466.00 frames. ], tot_loss[loss=0.2183, simple_loss=0.2935, pruned_loss=0.07158, over 4274967.28 frames. ], batch size: 212, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 03:09:01,671 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.534e+02 5.033e+02 6.854e+02 1.116e+03 2.187e+03, threshold=1.371e+03, percent-clipped=18.0 2023-06-26 03:09:23,425 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.95 vs. limit=15.0 2023-06-26 03:09:28,102 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1478562.0, ans=0.125 2023-06-26 03:09:28,114 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1478562.0, ans=0.0 2023-06-26 03:10:21,191 INFO [train.py:996] (2/4) Epoch 9, batch 2500, loss[loss=0.2018, simple_loss=0.3037, pruned_loss=0.04995, over 21683.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.2936, pruned_loss=0.0714, over 4263268.53 frames. ], batch size: 298, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 03:10:31,762 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.04 vs. limit=10.0 2023-06-26 03:10:34,394 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1478742.0, ans=0.0 2023-06-26 03:11:40,694 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1478922.0, ans=0.5 2023-06-26 03:11:55,374 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1478982.0, ans=0.1 2023-06-26 03:12:06,828 INFO [train.py:996] (2/4) Epoch 9, batch 2550, loss[loss=0.2026, simple_loss=0.2951, pruned_loss=0.05511, over 21605.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2922, pruned_loss=0.06957, over 4255878.21 frames. ], batch size: 230, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 03:12:25,015 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1479042.0, ans=0.125 2023-06-26 03:12:37,800 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.177e+02 4.403e+02 6.951e+02 9.882e+02 2.721e+03, threshold=1.390e+03, percent-clipped=12.0 2023-06-26 03:13:13,442 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.82 vs. limit=15.0 2023-06-26 03:13:18,062 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1479162.0, ans=0.1 2023-06-26 03:13:31,768 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1479222.0, ans=0.125 2023-06-26 03:13:39,487 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.80 vs. limit=22.5 2023-06-26 03:13:57,246 INFO [train.py:996] (2/4) Epoch 9, batch 2600, loss[loss=0.2396, simple_loss=0.3086, pruned_loss=0.0853, over 21575.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.2949, pruned_loss=0.07076, over 4260349.54 frames. ], batch size: 415, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 03:14:09,388 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1479342.0, ans=0.025 2023-06-26 03:14:16,308 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1479342.0, ans=0.0 2023-06-26 03:15:01,763 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_na.min_abs, batch_count=1479462.0, ans=0.02 2023-06-26 03:15:28,240 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1479582.0, ans=0.125 2023-06-26 03:15:31,957 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1479582.0, ans=0.2 2023-06-26 03:15:43,489 INFO [train.py:996] (2/4) Epoch 9, batch 2650, loss[loss=0.2048, simple_loss=0.298, pruned_loss=0.05583, over 21776.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.2951, pruned_loss=0.07199, over 4270634.58 frames. ], batch size: 332, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 03:16:14,201 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.480e+02 5.388e+02 7.988e+02 1.143e+03 2.285e+03, threshold=1.598e+03, percent-clipped=12.0 2023-06-26 03:16:38,336 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1479762.0, ans=0.125 2023-06-26 03:16:46,600 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.10 vs. limit=22.5 2023-06-26 03:17:08,785 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1479822.0, ans=0.2 2023-06-26 03:17:22,689 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1479882.0, ans=0.2 2023-06-26 03:17:29,017 INFO [train.py:996] (2/4) Epoch 9, batch 2700, loss[loss=0.1646, simple_loss=0.2276, pruned_loss=0.05075, over 21345.00 frames. ], tot_loss[loss=0.2167, simple_loss=0.2922, pruned_loss=0.07059, over 4269481.99 frames. ], batch size: 159, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 03:18:01,285 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1480002.0, ans=0.0 2023-06-26 03:18:02,751 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1480002.0, ans=0.125 2023-06-26 03:18:34,893 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1480062.0, ans=0.125 2023-06-26 03:19:10,098 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1480182.0, ans=0.1 2023-06-26 03:19:16,105 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.83 vs. limit=12.0 2023-06-26 03:19:20,002 INFO [train.py:996] (2/4) Epoch 9, batch 2750, loss[loss=0.2222, simple_loss=0.2988, pruned_loss=0.0728, over 21249.00 frames. ], tot_loss[loss=0.2164, simple_loss=0.2917, pruned_loss=0.07048, over 4270618.51 frames. ], batch size: 143, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 03:19:31,198 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1480242.0, ans=0.125 2023-06-26 03:19:51,123 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.311e+02 4.494e+02 5.812e+02 9.696e+02 2.134e+03, threshold=1.162e+03, percent-clipped=3.0 2023-06-26 03:20:57,858 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1480482.0, ans=0.125 2023-06-26 03:21:19,599 INFO [train.py:996] (2/4) Epoch 9, batch 2800, loss[loss=0.2598, simple_loss=0.3323, pruned_loss=0.09368, over 21766.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.2972, pruned_loss=0.07276, over 4273639.17 frames. ], batch size: 282, lr: 3.37e-03, grad_scale: 32.0 2023-06-26 03:21:36,025 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.41 vs. limit=15.0 2023-06-26 03:21:38,673 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1480542.0, ans=0.125 2023-06-26 03:23:10,837 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 03:23:14,725 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.67 vs. limit=5.0 2023-06-26 03:23:17,418 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1480842.0, ans=0.025 2023-06-26 03:23:18,686 INFO [train.py:996] (2/4) Epoch 9, batch 2850, loss[loss=0.1523, simple_loss=0.2075, pruned_loss=0.0485, over 21251.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.2969, pruned_loss=0.07322, over 4267125.50 frames. ], batch size: 143, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 03:23:45,694 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.704e+02 5.417e+02 7.792e+02 1.299e+03 2.553e+03, threshold=1.558e+03, percent-clipped=28.0 2023-06-26 03:24:25,052 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1481022.0, ans=0.09899494936611666 2023-06-26 03:24:45,940 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_ff2.min_abs, batch_count=1481082.0, ans=0.1 2023-06-26 03:25:03,448 INFO [train.py:996] (2/4) Epoch 9, batch 2900, loss[loss=0.2538, simple_loss=0.3107, pruned_loss=0.0984, over 21736.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.2966, pruned_loss=0.07303, over 4268789.04 frames. ], batch size: 473, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 03:25:54,001 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1481262.0, ans=0.0 2023-06-26 03:25:57,471 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1481262.0, ans=0.0 2023-06-26 03:26:02,529 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1481262.0, ans=0.0 2023-06-26 03:26:18,484 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1481322.0, ans=0.125 2023-06-26 03:26:22,431 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1481322.0, ans=0.125 2023-06-26 03:26:22,526 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1481322.0, ans=0.125 2023-06-26 03:26:24,148 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1481382.0, ans=0.125 2023-06-26 03:26:53,594 INFO [train.py:996] (2/4) Epoch 9, batch 2950, loss[loss=0.2088, simple_loss=0.2926, pruned_loss=0.0625, over 21418.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.2987, pruned_loss=0.07271, over 4274925.92 frames. ], batch size: 144, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 03:26:59,375 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1481442.0, ans=0.125 2023-06-26 03:27:21,364 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.283e+02 4.507e+02 5.801e+02 9.754e+02 1.778e+03, threshold=1.160e+03, percent-clipped=2.0 2023-06-26 03:27:22,699 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.60 vs. limit=22.5 2023-06-26 03:27:54,422 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1481622.0, ans=0.1 2023-06-26 03:28:18,667 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1481682.0, ans=0.05 2023-06-26 03:28:38,619 INFO [train.py:996] (2/4) Epoch 9, batch 3000, loss[loss=0.2434, simple_loss=0.3282, pruned_loss=0.07932, over 21444.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.3028, pruned_loss=0.07384, over 4282540.77 frames. ], batch size: 131, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 03:28:38,619 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-26 03:28:55,830 INFO [zipformer.py:1728] (2/4) name=encoder.encoders.3.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([2.2746, 3.0336, 3.1788, 3.3907, 2.7939, 2.7876, 3.4457, 3.3953], device='cuda:2') 2023-06-26 03:29:01,207 INFO [train.py:1028] (2/4) Epoch 9, validation: loss=0.2514, simple_loss=0.3427, pruned_loss=0.08003, over 1796401.00 frames. 2023-06-26 03:29:01,208 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 23731MB 2023-06-26 03:29:03,901 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1481742.0, ans=0.125 2023-06-26 03:30:20,678 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1481922.0, ans=0.025 2023-06-26 03:30:43,556 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1481982.0, ans=0.125 2023-06-26 03:30:48,445 INFO [train.py:996] (2/4) Epoch 9, batch 3050, loss[loss=0.1996, simple_loss=0.2689, pruned_loss=0.0652, over 21782.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.3014, pruned_loss=0.07151, over 4278303.90 frames. ], batch size: 112, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 03:31:01,306 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff3.min_abs, batch_count=1482042.0, ans=0.2 2023-06-26 03:31:09,640 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.261e+02 4.684e+02 7.478e+02 1.068e+03 1.857e+03, threshold=1.496e+03, percent-clipped=20.0 2023-06-26 03:31:16,659 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.91 vs. limit=15.0 2023-06-26 03:32:04,099 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1482222.0, ans=0.125 2023-06-26 03:32:05,877 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1482222.0, ans=10.0 2023-06-26 03:32:42,247 INFO [train.py:996] (2/4) Epoch 9, batch 3100, loss[loss=0.2355, simple_loss=0.3252, pruned_loss=0.0729, over 21602.00 frames. ], tot_loss[loss=0.2212, simple_loss=0.301, pruned_loss=0.07069, over 4270759.10 frames. ], batch size: 441, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 03:33:03,457 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1482402.0, ans=0.125 2023-06-26 03:33:03,972 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.01 vs. limit=22.5 2023-06-26 03:34:36,177 INFO [train.py:996] (2/4) Epoch 9, batch 3150, loss[loss=0.2543, simple_loss=0.3267, pruned_loss=0.09092, over 21327.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.3033, pruned_loss=0.07118, over 4264489.15 frames. ], batch size: 143, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 03:34:40,521 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 03:34:51,557 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1482642.0, ans=0.1 2023-06-26 03:34:58,319 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.871e+02 4.385e+02 6.208e+02 9.255e+02 2.149e+03, threshold=1.242e+03, percent-clipped=3.0 2023-06-26 03:35:08,242 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1482702.0, ans=0.1 2023-06-26 03:35:32,048 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1482762.0, ans=0.125 2023-06-26 03:36:04,224 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.23 vs. limit=22.5 2023-06-26 03:36:28,386 INFO [train.py:996] (2/4) Epoch 9, batch 3200, loss[loss=0.1991, simple_loss=0.2747, pruned_loss=0.06173, over 21219.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.303, pruned_loss=0.07126, over 4261690.61 frames. ], batch size: 159, lr: 3.36e-03, grad_scale: 32.0 2023-06-26 03:36:36,641 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1482942.0, ans=0.0 2023-06-26 03:37:39,041 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1483122.0, ans=0.125 2023-06-26 03:37:40,698 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1483122.0, ans=0.1 2023-06-26 03:37:58,342 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1483182.0, ans=0.125 2023-06-26 03:38:13,678 INFO [train.py:996] (2/4) Epoch 9, batch 3250, loss[loss=0.2308, simple_loss=0.2957, pruned_loss=0.08297, over 21877.00 frames. ], tot_loss[loss=0.2248, simple_loss=0.3032, pruned_loss=0.07317, over 4267355.19 frames. ], batch size: 372, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 03:38:17,822 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1483242.0, ans=0.2 2023-06-26 03:38:19,563 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 03:38:47,166 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.805e+02 4.877e+02 6.649e+02 1.271e+03 2.472e+03, threshold=1.330e+03, percent-clipped=27.0 2023-06-26 03:39:33,537 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1483422.0, ans=0.0 2023-06-26 03:39:35,457 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1483422.0, ans=0.1 2023-06-26 03:39:41,028 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1483422.0, ans=0.125 2023-06-26 03:40:05,655 INFO [train.py:996] (2/4) Epoch 9, batch 3300, loss[loss=0.2283, simple_loss=0.316, pruned_loss=0.07026, over 21561.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.2986, pruned_loss=0.0729, over 4267927.50 frames. ], batch size: 414, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 03:41:24,393 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1483722.0, ans=0.0 2023-06-26 03:41:39,068 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1483782.0, ans=0.125 2023-06-26 03:41:51,379 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1483782.0, ans=0.125 2023-06-26 03:42:03,502 INFO [train.py:996] (2/4) Epoch 9, batch 3350, loss[loss=0.2385, simple_loss=0.3115, pruned_loss=0.08276, over 21356.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.301, pruned_loss=0.07326, over 4262170.73 frames. ], batch size: 176, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 03:42:27,565 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1483902.0, ans=0.04949747468305833 2023-06-26 03:42:36,952 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.432e+02 5.027e+02 7.904e+02 1.051e+03 2.659e+03, threshold=1.581e+03, percent-clipped=15.0 2023-06-26 03:43:28,009 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.71 vs. limit=6.0 2023-06-26 03:43:32,536 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1484082.0, ans=0.125 2023-06-26 03:43:37,933 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1484082.0, ans=0.125 2023-06-26 03:43:58,363 INFO [train.py:996] (2/4) Epoch 9, batch 3400, loss[loss=0.2665, simple_loss=0.3389, pruned_loss=0.09709, over 21482.00 frames. ], tot_loss[loss=0.224, simple_loss=0.3005, pruned_loss=0.07373, over 4277006.28 frames. ], batch size: 507, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 03:45:08,754 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1484322.0, ans=0.1 2023-06-26 03:45:10,549 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_na.min_abs, batch_count=1484322.0, ans=0.02 2023-06-26 03:45:34,897 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1484382.0, ans=0.125 2023-06-26 03:45:51,282 INFO [train.py:996] (2/4) Epoch 9, batch 3450, loss[loss=0.2408, simple_loss=0.314, pruned_loss=0.08376, over 21862.00 frames. ], tot_loss[loss=0.2199, simple_loss=0.2946, pruned_loss=0.07261, over 4277446.54 frames. ], batch size: 372, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 03:46:19,613 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.178e+02 5.080e+02 7.210e+02 9.972e+02 1.993e+03, threshold=1.442e+03, percent-clipped=4.0 2023-06-26 03:46:20,122 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1484502.0, ans=0.125 2023-06-26 03:46:42,636 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1484562.0, ans=0.125 2023-06-26 03:47:03,907 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1484622.0, ans=10.0 2023-06-26 03:47:06,355 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.31 vs. limit=10.0 2023-06-26 03:47:47,819 INFO [train.py:996] (2/4) Epoch 9, batch 3500, loss[loss=0.2988, simple_loss=0.3704, pruned_loss=0.1136, over 21728.00 frames. ], tot_loss[loss=0.228, simple_loss=0.3032, pruned_loss=0.07635, over 4281164.24 frames. ], batch size: 441, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 03:48:12,892 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.11 vs. limit=6.0 2023-06-26 03:49:37,620 INFO [train.py:996] (2/4) Epoch 9, batch 3550, loss[loss=0.203, simple_loss=0.2858, pruned_loss=0.06009, over 21008.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3061, pruned_loss=0.07755, over 4274210.62 frames. ], batch size: 608, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 03:50:05,953 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.388e+02 4.836e+02 6.336e+02 9.493e+02 2.947e+03, threshold=1.267e+03, percent-clipped=8.0 2023-06-26 03:50:30,650 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1485162.0, ans=0.125 2023-06-26 03:51:27,690 INFO [train.py:996] (2/4) Epoch 9, batch 3600, loss[loss=0.2257, simple_loss=0.2966, pruned_loss=0.0774, over 21253.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.3008, pruned_loss=0.07602, over 4268368.53 frames. ], batch size: 143, lr: 3.36e-03, grad_scale: 32.0 2023-06-26 03:51:28,651 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 03:52:09,795 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1485462.0, ans=0.1 2023-06-26 03:52:17,291 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1485462.0, ans=0.0 2023-06-26 03:52:45,225 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1485522.0, ans=0.1 2023-06-26 03:53:07,407 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1485582.0, ans=0.125 2023-06-26 03:53:28,623 INFO [train.py:996] (2/4) Epoch 9, batch 3650, loss[loss=0.2683, simple_loss=0.3354, pruned_loss=0.1006, over 21420.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.3013, pruned_loss=0.07576, over 4267851.54 frames. ], batch size: 471, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 03:53:35,266 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1485642.0, ans=0.05 2023-06-26 03:53:35,344 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1485642.0, ans=0.125 2023-06-26 03:53:41,889 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1485642.0, ans=0.2 2023-06-26 03:53:53,234 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.365e+02 4.857e+02 6.488e+02 1.037e+03 3.171e+03, threshold=1.298e+03, percent-clipped=18.0 2023-06-26 03:54:01,343 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1485702.0, ans=0.09899494936611666 2023-06-26 03:54:34,508 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.54 vs. limit=15.0 2023-06-26 03:54:39,414 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1485822.0, ans=0.1 2023-06-26 03:55:19,288 INFO [train.py:996] (2/4) Epoch 9, batch 3700, loss[loss=0.2186, simple_loss=0.2964, pruned_loss=0.07044, over 21556.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.3003, pruned_loss=0.07426, over 4268390.10 frames. ], batch size: 131, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 03:55:23,293 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1485942.0, ans=0.1 2023-06-26 03:55:57,296 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1486062.0, ans=0.125 2023-06-26 03:56:18,221 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.21 vs. limit=12.0 2023-06-26 03:56:36,918 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1486122.0, ans=0.0 2023-06-26 03:57:10,170 INFO [train.py:996] (2/4) Epoch 9, batch 3750, loss[loss=0.2571, simple_loss=0.3577, pruned_loss=0.07827, over 19874.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.3011, pruned_loss=0.07458, over 4269623.13 frames. ], batch size: 702, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 03:57:12,565 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1486242.0, ans=0.0 2023-06-26 03:57:35,336 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.387e+02 4.638e+02 6.369e+02 1.007e+03 1.951e+03, threshold=1.274e+03, percent-clipped=10.0 2023-06-26 03:57:49,606 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1486362.0, ans=0.125 2023-06-26 03:57:49,637 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1486362.0, ans=0.0 2023-06-26 03:58:40,143 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1486422.0, ans=0.125 2023-06-26 03:58:41,954 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1486482.0, ans=0.125 2023-06-26 03:58:51,214 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1486482.0, ans=0.2 2023-06-26 03:59:00,825 INFO [train.py:996] (2/4) Epoch 9, batch 3800, loss[loss=0.2193, simple_loss=0.2994, pruned_loss=0.06958, over 21738.00 frames. ], tot_loss[loss=0.2214, simple_loss=0.2979, pruned_loss=0.07241, over 4272725.68 frames. ], batch size: 351, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 03:59:03,701 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.91 vs. limit=15.0 2023-06-26 03:59:21,112 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1486602.0, ans=0.0 2023-06-26 03:59:33,943 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.24 vs. limit=15.0 2023-06-26 04:00:34,374 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1486782.0, ans=0.0 2023-06-26 04:00:46,681 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1486782.0, ans=0.1 2023-06-26 04:00:49,610 INFO [train.py:996] (2/4) Epoch 9, batch 3850, loss[loss=0.1914, simple_loss=0.2531, pruned_loss=0.06489, over 21156.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.2954, pruned_loss=0.07301, over 4272758.65 frames. ], batch size: 176, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 04:01:08,031 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1486902.0, ans=0.125 2023-06-26 04:01:19,295 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.209e+02 4.310e+02 5.472e+02 7.871e+02 1.774e+03, threshold=1.094e+03, percent-clipped=3.0 2023-06-26 04:01:26,781 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1486902.0, ans=0.125 2023-06-26 04:01:40,522 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.whiten.whitening_limit, batch_count=1486962.0, ans=15.0 2023-06-26 04:02:22,516 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.15 vs. limit=15.0 2023-06-26 04:02:39,300 INFO [train.py:996] (2/4) Epoch 9, batch 3900, loss[loss=0.2316, simple_loss=0.2885, pruned_loss=0.08733, over 21666.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.291, pruned_loss=0.07268, over 4271377.57 frames. ], batch size: 414, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 04:02:56,251 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1487202.0, ans=0.0 2023-06-26 04:04:29,558 INFO [train.py:996] (2/4) Epoch 9, batch 3950, loss[loss=0.1898, simple_loss=0.2787, pruned_loss=0.05048, over 21800.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.2943, pruned_loss=0.07259, over 4279883.00 frames. ], batch size: 371, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 04:04:32,283 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1487442.0, ans=0.09899494936611666 2023-06-26 04:04:59,611 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.430e+02 5.269e+02 7.379e+02 1.187e+03 2.051e+03, threshold=1.476e+03, percent-clipped=29.0 2023-06-26 04:05:34,004 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1487562.0, ans=0.125 2023-06-26 04:06:21,602 INFO [train.py:996] (2/4) Epoch 9, batch 4000, loss[loss=0.2248, simple_loss=0.3149, pruned_loss=0.06732, over 20744.00 frames. ], tot_loss[loss=0.2133, simple_loss=0.2877, pruned_loss=0.06944, over 4267152.09 frames. ], batch size: 608, lr: 3.36e-03, grad_scale: 32.0 2023-06-26 04:07:06,802 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1487802.0, ans=0.125 2023-06-26 04:07:49,295 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1487922.0, ans=0.125 2023-06-26 04:07:51,321 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1487922.0, ans=0.125 2023-06-26 04:07:56,583 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1487982.0, ans=0.0 2023-06-26 04:08:15,068 INFO [train.py:996] (2/4) Epoch 9, batch 4050, loss[loss=0.2023, simple_loss=0.2795, pruned_loss=0.06255, over 21358.00 frames. ], tot_loss[loss=0.2109, simple_loss=0.2866, pruned_loss=0.06762, over 4271622.38 frames. ], batch size: 176, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 04:08:17,485 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1488042.0, ans=0.0 2023-06-26 04:08:39,754 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1488102.0, ans=0.0 2023-06-26 04:08:54,029 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.060e+02 4.408e+02 5.792e+02 1.027e+03 1.957e+03, threshold=1.158e+03, percent-clipped=6.0 2023-06-26 04:09:22,644 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1488162.0, ans=0.0 2023-06-26 04:09:47,770 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1488282.0, ans=0.125 2023-06-26 04:09:47,847 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1488282.0, ans=0.125 2023-06-26 04:10:06,513 INFO [train.py:996] (2/4) Epoch 9, batch 4100, loss[loss=0.2438, simple_loss=0.3177, pruned_loss=0.085, over 21939.00 frames. ], tot_loss[loss=0.2121, simple_loss=0.2887, pruned_loss=0.06771, over 4280101.44 frames. ], batch size: 107, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 04:10:07,373 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1488342.0, ans=0.125 2023-06-26 04:10:15,238 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1488342.0, ans=0.125 2023-06-26 04:11:18,621 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1488462.0, ans=0.1 2023-06-26 04:11:29,683 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1488522.0, ans=0.1 2023-06-26 04:11:38,789 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 04:11:58,943 INFO [train.py:996] (2/4) Epoch 9, batch 4150, loss[loss=0.1825, simple_loss=0.2853, pruned_loss=0.03988, over 21631.00 frames. ], tot_loss[loss=0.2094, simple_loss=0.2886, pruned_loss=0.06514, over 4267349.53 frames. ], batch size: 247, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 04:12:31,004 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1488702.0, ans=0.125 2023-06-26 04:12:42,729 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.917e+02 4.750e+02 6.636e+02 9.716e+02 1.939e+03, threshold=1.327e+03, percent-clipped=13.0 2023-06-26 04:12:46,935 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1488702.0, ans=0.0 2023-06-26 04:12:54,354 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1488762.0, ans=0.125 2023-06-26 04:13:20,632 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.68 vs. limit=10.0 2023-06-26 04:13:54,542 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1488882.0, ans=0.2 2023-06-26 04:13:57,475 INFO [train.py:996] (2/4) Epoch 9, batch 4200, loss[loss=0.2029, simple_loss=0.2753, pruned_loss=0.0653, over 21481.00 frames. ], tot_loss[loss=0.2098, simple_loss=0.2892, pruned_loss=0.06526, over 4255787.73 frames. ], batch size: 212, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 04:14:13,474 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1488942.0, ans=0.0 2023-06-26 04:14:28,653 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.30 vs. limit=15.0 2023-06-26 04:14:31,485 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1489002.0, ans=0.125 2023-06-26 04:14:31,552 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1489002.0, ans=0.2 2023-06-26 04:14:37,242 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.92 vs. limit=22.5 2023-06-26 04:15:16,926 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1489122.0, ans=0.0 2023-06-26 04:15:28,342 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1489122.0, ans=0.125 2023-06-26 04:15:50,623 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.48 vs. limit=22.5 2023-06-26 04:15:51,631 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1489182.0, ans=0.1 2023-06-26 04:15:56,536 INFO [train.py:996] (2/4) Epoch 9, batch 4250, loss[loss=0.2177, simple_loss=0.2968, pruned_loss=0.06928, over 21415.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.2943, pruned_loss=0.06694, over 4254972.01 frames. ], batch size: 194, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 04:16:34,293 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.53 vs. limit=6.0 2023-06-26 04:16:34,529 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.173e+02 6.968e+02 9.905e+02 1.425e+03 3.258e+03, threshold=1.981e+03, percent-clipped=30.0 2023-06-26 04:16:48,066 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 04:17:06,388 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1489422.0, ans=0.125 2023-06-26 04:17:17,038 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1489422.0, ans=0.1 2023-06-26 04:17:34,356 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1489482.0, ans=0.0 2023-06-26 04:17:55,692 INFO [train.py:996] (2/4) Epoch 9, batch 4300, loss[loss=0.2021, simple_loss=0.2677, pruned_loss=0.06828, over 21195.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.3028, pruned_loss=0.06984, over 4259915.58 frames. ], batch size: 608, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 04:18:33,133 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1489602.0, ans=0.2 2023-06-26 04:19:19,219 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1489722.0, ans=0.0 2023-06-26 04:19:19,257 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1489722.0, ans=0.0 2023-06-26 04:19:51,287 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1489842.0, ans=0.125 2023-06-26 04:19:52,224 INFO [train.py:996] (2/4) Epoch 9, batch 4350, loss[loss=0.1832, simple_loss=0.2611, pruned_loss=0.05261, over 21376.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.3025, pruned_loss=0.06944, over 4261255.14 frames. ], batch size: 131, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 04:20:18,897 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.396e+02 4.613e+02 6.929e+02 1.161e+03 2.829e+03, threshold=1.386e+03, percent-clipped=7.0 2023-06-26 04:20:26,463 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1489962.0, ans=0.0 2023-06-26 04:21:38,314 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.62 vs. limit=15.0 2023-06-26 04:21:42,257 INFO [train.py:996] (2/4) Epoch 9, batch 4400, loss[loss=0.2001, simple_loss=0.289, pruned_loss=0.05567, over 19963.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.297, pruned_loss=0.06858, over 4255226.62 frames. ], batch size: 702, lr: 3.36e-03, grad_scale: 32.0 2023-06-26 04:23:35,721 INFO [train.py:996] (2/4) Epoch 9, batch 4450, loss[loss=0.2363, simple_loss=0.3163, pruned_loss=0.07821, over 21280.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.3043, pruned_loss=0.07045, over 4258513.24 frames. ], batch size: 159, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 04:24:03,478 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.467e+02 5.132e+02 7.510e+02 1.153e+03 2.650e+03, threshold=1.502e+03, percent-clipped=12.0 2023-06-26 04:24:06,036 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1490502.0, ans=0.125 2023-06-26 04:24:13,457 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1490502.0, ans=0.125 2023-06-26 04:25:25,702 INFO [train.py:996] (2/4) Epoch 9, batch 4500, loss[loss=0.2188, simple_loss=0.2957, pruned_loss=0.07092, over 21870.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.307, pruned_loss=0.07259, over 4268033.74 frames. ], batch size: 124, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 04:25:40,540 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1490742.0, ans=0.125 2023-06-26 04:25:56,221 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1490802.0, ans=0.125 2023-06-26 04:26:50,801 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1490922.0, ans=0.125 2023-06-26 04:26:50,815 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1490922.0, ans=0.125 2023-06-26 04:27:15,814 INFO [train.py:996] (2/4) Epoch 9, batch 4550, loss[loss=0.2804, simple_loss=0.3481, pruned_loss=0.1063, over 21207.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.3084, pruned_loss=0.07271, over 4270597.31 frames. ], batch size: 143, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 04:27:46,467 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=2.98 vs. limit=12.0 2023-06-26 04:27:47,466 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1491102.0, ans=0.2 2023-06-26 04:27:56,092 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1491102.0, ans=0.125 2023-06-26 04:28:00,504 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.475e+02 4.870e+02 6.557e+02 1.171e+03 3.635e+03, threshold=1.311e+03, percent-clipped=15.0 2023-06-26 04:28:08,491 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1491162.0, ans=0.2 2023-06-26 04:28:53,999 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1491282.0, ans=0.125 2023-06-26 04:28:55,959 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1491282.0, ans=0.1 2023-06-26 04:28:55,965 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1491282.0, ans=0.0 2023-06-26 04:29:05,595 INFO [train.py:996] (2/4) Epoch 9, batch 4600, loss[loss=0.2201, simple_loss=0.2994, pruned_loss=0.0704, over 21708.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.3089, pruned_loss=0.07383, over 4273603.91 frames. ], batch size: 441, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 04:29:33,435 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.93 vs. limit=22.5 2023-06-26 04:30:16,846 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1491462.0, ans=0.1 2023-06-26 04:30:48,598 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.86 vs. limit=15.0 2023-06-26 04:30:57,616 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1491582.0, ans=0.05 2023-06-26 04:30:59,053 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1491642.0, ans=0.0 2023-06-26 04:30:59,059 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1491642.0, ans=0.0 2023-06-26 04:31:00,436 INFO [train.py:996] (2/4) Epoch 9, batch 4650, loss[loss=0.1888, simple_loss=0.2556, pruned_loss=0.06103, over 20158.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.3032, pruned_loss=0.07205, over 4276543.55 frames. ], batch size: 703, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 04:31:29,254 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1491702.0, ans=0.1 2023-06-26 04:31:37,633 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1491702.0, ans=0.0 2023-06-26 04:31:38,990 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.118e+02 4.318e+02 5.535e+02 7.322e+02 1.899e+03, threshold=1.107e+03, percent-clipped=2.0 2023-06-26 04:31:40,200 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.96 vs. limit=15.0 2023-06-26 04:32:33,885 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1491882.0, ans=0.125 2023-06-26 04:32:55,332 INFO [train.py:996] (2/4) Epoch 9, batch 4700, loss[loss=0.1925, simple_loss=0.2596, pruned_loss=0.06267, over 21784.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.2938, pruned_loss=0.07, over 4278661.93 frames. ], batch size: 317, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 04:34:38,024 INFO [train.py:996] (2/4) Epoch 9, batch 4750, loss[loss=0.1872, simple_loss=0.2535, pruned_loss=0.06047, over 21594.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.2883, pruned_loss=0.07027, over 4274360.94 frames. ], batch size: 298, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 04:34:52,640 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1492242.0, ans=0.0 2023-06-26 04:35:16,911 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.998e+02 4.450e+02 6.729e+02 1.004e+03 1.717e+03, threshold=1.346e+03, percent-clipped=12.0 2023-06-26 04:35:22,950 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1492362.0, ans=0.1 2023-06-26 04:35:33,819 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1492362.0, ans=0.125 2023-06-26 04:35:46,512 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1492422.0, ans=0.125 2023-06-26 04:36:00,485 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1492422.0, ans=0.0 2023-06-26 04:36:30,165 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1492482.0, ans=0.0 2023-06-26 04:36:32,956 INFO [train.py:996] (2/4) Epoch 9, batch 4800, loss[loss=0.2013, simple_loss=0.2995, pruned_loss=0.0515, over 21782.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2885, pruned_loss=0.07089, over 4281165.62 frames. ], batch size: 351, lr: 3.35e-03, grad_scale: 32.0 2023-06-26 04:37:34,941 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.12 vs. limit=6.0 2023-06-26 04:38:21,211 INFO [train.py:996] (2/4) Epoch 9, batch 4850, loss[loss=0.2146, simple_loss=0.2954, pruned_loss=0.06692, over 21855.00 frames. ], tot_loss[loss=0.2142, simple_loss=0.2879, pruned_loss=0.07026, over 4280343.53 frames. ], batch size: 124, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 04:38:27,656 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.81 vs. limit=15.0 2023-06-26 04:38:47,023 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.67 vs. limit=15.0 2023-06-26 04:38:56,364 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.280e+02 4.162e+02 5.020e+02 8.423e+02 2.243e+03, threshold=1.004e+03, percent-clipped=7.0 2023-06-26 04:39:28,114 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.49 vs. limit=15.0 2023-06-26 04:39:58,360 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1493082.0, ans=0.125 2023-06-26 04:40:11,782 INFO [train.py:996] (2/4) Epoch 9, batch 4900, loss[loss=0.1869, simple_loss=0.2346, pruned_loss=0.06958, over 20207.00 frames. ], tot_loss[loss=0.216, simple_loss=0.2901, pruned_loss=0.07092, over 4276536.27 frames. ], batch size: 703, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 04:40:17,679 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1493142.0, ans=0.125 2023-06-26 04:41:52,071 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1493382.0, ans=0.0 2023-06-26 04:41:57,278 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1493382.0, ans=0.125 2023-06-26 04:42:01,694 INFO [train.py:996] (2/4) Epoch 9, batch 4950, loss[loss=0.1939, simple_loss=0.2846, pruned_loss=0.05161, over 21788.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.295, pruned_loss=0.069, over 4283720.00 frames. ], batch size: 282, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 04:42:42,358 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.982e+02 5.004e+02 7.690e+02 1.209e+03 2.410e+03, threshold=1.538e+03, percent-clipped=31.0 2023-06-26 04:42:56,672 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1493562.0, ans=0.0 2023-06-26 04:43:40,772 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.85 vs. limit=12.0 2023-06-26 04:43:44,939 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1493682.0, ans=0.2 2023-06-26 04:43:49,279 INFO [train.py:996] (2/4) Epoch 9, batch 5000, loss[loss=0.2449, simple_loss=0.3064, pruned_loss=0.09169, over 21314.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.2932, pruned_loss=0.06583, over 4284588.28 frames. ], batch size: 143, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 04:44:00,757 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.55 vs. limit=15.0 2023-06-26 04:45:37,519 INFO [train.py:996] (2/4) Epoch 9, batch 5050, loss[loss=0.2227, simple_loss=0.2971, pruned_loss=0.07413, over 21847.00 frames. ], tot_loss[loss=0.2142, simple_loss=0.2934, pruned_loss=0.06753, over 4286748.29 frames. ], batch size: 118, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 04:46:12,688 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.467e+02 4.718e+02 6.361e+02 8.600e+02 1.640e+03, threshold=1.272e+03, percent-clipped=2.0 2023-06-26 04:46:45,714 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 04:46:54,366 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1494222.0, ans=0.0 2023-06-26 04:47:10,335 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.73 vs. limit=22.5 2023-06-26 04:47:26,072 INFO [train.py:996] (2/4) Epoch 9, batch 5100, loss[loss=0.1898, simple_loss=0.2698, pruned_loss=0.05494, over 21822.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.2919, pruned_loss=0.06811, over 4293045.67 frames. ], batch size: 332, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 04:48:33,535 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1494522.0, ans=0.2 2023-06-26 04:49:09,879 INFO [train.py:996] (2/4) Epoch 9, batch 5150, loss[loss=0.252, simple_loss=0.322, pruned_loss=0.09102, over 21732.00 frames. ], tot_loss[loss=0.2142, simple_loss=0.2921, pruned_loss=0.06814, over 4287513.29 frames. ], batch size: 441, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 04:49:19,135 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1494642.0, ans=0.1 2023-06-26 04:49:40,010 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1494702.0, ans=0.125 2023-06-26 04:49:50,342 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.908e+02 4.572e+02 6.344e+02 1.136e+03 2.635e+03, threshold=1.269e+03, percent-clipped=18.0 2023-06-26 04:50:05,214 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.41 vs. limit=15.0 2023-06-26 04:50:06,392 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1494762.0, ans=0.125 2023-06-26 04:51:10,756 INFO [train.py:996] (2/4) Epoch 9, batch 5200, loss[loss=0.2097, simple_loss=0.2936, pruned_loss=0.06287, over 21275.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.2967, pruned_loss=0.0691, over 4275662.06 frames. ], batch size: 159, lr: 3.35e-03, grad_scale: 32.0 2023-06-26 04:51:43,253 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1495002.0, ans=0.0 2023-06-26 04:52:16,267 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1495122.0, ans=10.0 2023-06-26 04:52:58,605 INFO [train.py:996] (2/4) Epoch 9, batch 5250, loss[loss=0.2261, simple_loss=0.3111, pruned_loss=0.0705, over 21831.00 frames. ], tot_loss[loss=0.2192, simple_loss=0.302, pruned_loss=0.06816, over 4268707.53 frames. ], batch size: 316, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 04:53:19,195 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.04 vs. limit=6.0 2023-06-26 04:53:20,069 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1495302.0, ans=0.1 2023-06-26 04:53:20,121 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1495302.0, ans=0.125 2023-06-26 04:53:34,845 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1495302.0, ans=0.0 2023-06-26 04:53:35,849 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.142e+02 4.723e+02 6.704e+02 8.682e+02 1.617e+03, threshold=1.341e+03, percent-clipped=7.0 2023-06-26 04:54:08,220 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=1495422.0, ans=15.0 2023-06-26 04:54:16,970 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1495422.0, ans=0.1 2023-06-26 04:54:32,589 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.34 vs. limit=15.0 2023-06-26 04:54:50,688 INFO [train.py:996] (2/4) Epoch 9, batch 5300, loss[loss=0.2184, simple_loss=0.3318, pruned_loss=0.05252, over 19806.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.3008, pruned_loss=0.06834, over 4267279.35 frames. ], batch size: 702, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 04:55:11,323 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1495542.0, ans=0.2 2023-06-26 04:55:33,463 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1495662.0, ans=0.0 2023-06-26 04:55:53,484 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.67 vs. limit=22.5 2023-06-26 04:55:57,583 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.49 vs. limit=22.5 2023-06-26 04:56:26,540 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1495782.0, ans=0.125 2023-06-26 04:56:36,368 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1495782.0, ans=0.1 2023-06-26 04:56:38,148 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1495842.0, ans=0.125 2023-06-26 04:56:39,207 INFO [train.py:996] (2/4) Epoch 9, batch 5350, loss[loss=0.2056, simple_loss=0.2734, pruned_loss=0.06885, over 21954.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.2984, pruned_loss=0.06948, over 4282568.00 frames. ], batch size: 316, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 04:56:43,361 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1495842.0, ans=0.125 2023-06-26 04:57:15,453 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.489e+02 4.386e+02 5.571e+02 7.652e+02 1.743e+03, threshold=1.114e+03, percent-clipped=3.0 2023-06-26 04:57:24,182 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.86 vs. limit=15.0 2023-06-26 04:57:25,464 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1495962.0, ans=0.2 2023-06-26 04:57:56,544 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1496022.0, ans=0.125 2023-06-26 04:58:19,715 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1496082.0, ans=0.1 2023-06-26 04:58:27,552 INFO [train.py:996] (2/4) Epoch 9, batch 5400, loss[loss=0.2286, simple_loss=0.2982, pruned_loss=0.07953, over 21885.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.2977, pruned_loss=0.07012, over 4283009.37 frames. ], batch size: 414, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 04:59:44,447 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1496322.0, ans=0.1 2023-06-26 04:59:46,345 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1496322.0, ans=0.2 2023-06-26 05:00:22,765 INFO [train.py:996] (2/4) Epoch 9, batch 5450, loss[loss=0.2189, simple_loss=0.3157, pruned_loss=0.06106, over 21742.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2967, pruned_loss=0.06829, over 4286764.46 frames. ], batch size: 247, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 05:00:41,477 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1496502.0, ans=0.125 2023-06-26 05:00:54,980 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.911e+02 4.664e+02 7.291e+02 1.143e+03 2.963e+03, threshold=1.458e+03, percent-clipped=26.0 2023-06-26 05:01:00,802 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1496502.0, ans=0.0 2023-06-26 05:01:26,964 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1496622.0, ans=0.125 2023-06-26 05:01:30,639 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1496622.0, ans=0.125 2023-06-26 05:01:40,335 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1496622.0, ans=0.1 2023-06-26 05:02:12,203 INFO [train.py:996] (2/4) Epoch 9, batch 5500, loss[loss=0.175, simple_loss=0.262, pruned_loss=0.04403, over 21088.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.3003, pruned_loss=0.06527, over 4274359.44 frames. ], batch size: 143, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 05:02:40,363 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 05:03:05,162 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1496862.0, ans=0.0 2023-06-26 05:03:17,109 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1496862.0, ans=0.125 2023-06-26 05:04:02,056 INFO [train.py:996] (2/4) Epoch 9, batch 5550, loss[loss=0.2037, simple_loss=0.3022, pruned_loss=0.05255, over 21803.00 frames. ], tot_loss[loss=0.2133, simple_loss=0.3001, pruned_loss=0.06328, over 4276704.95 frames. ], batch size: 282, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 05:04:19,259 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1497042.0, ans=0.125 2023-06-26 05:04:44,536 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.199e+02 5.816e+02 9.061e+02 1.223e+03 2.185e+03, threshold=1.812e+03, percent-clipped=16.0 2023-06-26 05:05:06,808 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1497162.0, ans=0.125 2023-06-26 05:05:43,177 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1497282.0, ans=0.125 2023-06-26 05:05:43,184 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1497282.0, ans=0.125 2023-06-26 05:05:43,979 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.52 vs. limit=15.0 2023-06-26 05:05:45,084 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1497282.0, ans=0.0 2023-06-26 05:05:54,099 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 05:05:58,691 INFO [train.py:996] (2/4) Epoch 9, batch 5600, loss[loss=0.2117, simple_loss=0.2968, pruned_loss=0.06326, over 20119.00 frames. ], tot_loss[loss=0.2101, simple_loss=0.2973, pruned_loss=0.06143, over 4272217.80 frames. ], batch size: 702, lr: 3.35e-03, grad_scale: 32.0 2023-06-26 05:06:41,834 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.84 vs. limit=6.0 2023-06-26 05:06:43,269 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1497462.0, ans=10.0 2023-06-26 05:06:48,225 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1497462.0, ans=0.2 2023-06-26 05:06:53,236 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1497462.0, ans=0.1 2023-06-26 05:07:15,414 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1497522.0, ans=0.125 2023-06-26 05:07:45,550 INFO [train.py:996] (2/4) Epoch 9, batch 5650, loss[loss=0.211, simple_loss=0.2994, pruned_loss=0.06126, over 21256.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.2988, pruned_loss=0.06334, over 4274889.40 frames. ], batch size: 176, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 05:08:29,188 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.963e+02 5.175e+02 8.774e+02 1.262e+03 2.376e+03, threshold=1.755e+03, percent-clipped=8.0 2023-06-26 05:09:41,577 INFO [train.py:996] (2/4) Epoch 9, batch 5700, loss[loss=0.256, simple_loss=0.3734, pruned_loss=0.06933, over 21203.00 frames. ], tot_loss[loss=0.2137, simple_loss=0.2978, pruned_loss=0.06473, over 4279725.24 frames. ], batch size: 548, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 05:10:19,544 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1498002.0, ans=0.125 2023-06-26 05:11:01,969 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.07 vs. limit=15.0 2023-06-26 05:11:20,927 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1498182.0, ans=0.125 2023-06-26 05:11:23,512 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.34 vs. limit=15.0 2023-06-26 05:11:39,529 INFO [train.py:996] (2/4) Epoch 9, batch 5750, loss[loss=0.1869, simple_loss=0.2814, pruned_loss=0.04614, over 21691.00 frames. ], tot_loss[loss=0.2093, simple_loss=0.293, pruned_loss=0.06283, over 4274580.84 frames. ], batch size: 351, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 05:12:12,697 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1498302.0, ans=0.0 2023-06-26 05:12:18,973 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.313e+02 4.582e+02 6.982e+02 1.089e+03 2.466e+03, threshold=1.396e+03, percent-clipped=2.0 2023-06-26 05:12:19,557 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1498362.0, ans=0.125 2023-06-26 05:12:22,332 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.73 vs. limit=15.0 2023-06-26 05:12:55,316 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.32 vs. limit=15.0 2023-06-26 05:13:10,797 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1498482.0, ans=0.2 2023-06-26 05:13:26,510 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1498482.0, ans=0.125 2023-06-26 05:13:31,281 INFO [train.py:996] (2/4) Epoch 9, batch 5800, loss[loss=0.24, simple_loss=0.3401, pruned_loss=0.06997, over 21739.00 frames. ], tot_loss[loss=0.2084, simple_loss=0.2926, pruned_loss=0.06209, over 4264659.10 frames. ], batch size: 351, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 05:13:33,842 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1498542.0, ans=0.1 2023-06-26 05:13:53,909 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.18 vs. limit=15.0 2023-06-26 05:15:08,853 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1498782.0, ans=0.0 2023-06-26 05:15:27,947 INFO [train.py:996] (2/4) Epoch 9, batch 5850, loss[loss=0.1531, simple_loss=0.2453, pruned_loss=0.03043, over 21421.00 frames. ], tot_loss[loss=0.2047, simple_loss=0.2919, pruned_loss=0.05879, over 4265895.23 frames. ], batch size: 131, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 05:15:56,071 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1498902.0, ans=0.125 2023-06-26 05:16:05,870 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.859e+02 4.519e+02 6.797e+02 9.504e+02 2.240e+03, threshold=1.359e+03, percent-clipped=6.0 2023-06-26 05:17:15,175 INFO [train.py:996] (2/4) Epoch 9, batch 5900, loss[loss=0.1751, simple_loss=0.2577, pruned_loss=0.04619, over 21704.00 frames. ], tot_loss[loss=0.1967, simple_loss=0.2847, pruned_loss=0.05435, over 4268192.70 frames. ], batch size: 230, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 05:17:53,435 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1499202.0, ans=0.125 2023-06-26 05:18:02,254 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1499262.0, ans=0.125 2023-06-26 05:18:05,972 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1499262.0, ans=0.125 2023-06-26 05:18:14,417 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1499262.0, ans=0.125 2023-06-26 05:19:04,246 INFO [train.py:996] (2/4) Epoch 9, batch 5950, loss[loss=0.2102, simple_loss=0.3259, pruned_loss=0.04724, over 19760.00 frames. ], tot_loss[loss=0.2, simple_loss=0.2849, pruned_loss=0.05759, over 4271125.41 frames. ], batch size: 703, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 05:19:25,787 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1499502.0, ans=0.125 2023-06-26 05:19:43,047 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1499502.0, ans=0.09899494936611666 2023-06-26 05:19:47,181 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.880e+02 4.429e+02 6.642e+02 9.511e+02 2.071e+03, threshold=1.328e+03, percent-clipped=8.0 2023-06-26 05:20:50,645 INFO [train.py:996] (2/4) Epoch 9, batch 6000, loss[loss=0.2161, simple_loss=0.2704, pruned_loss=0.08089, over 21451.00 frames. ], tot_loss[loss=0.2014, simple_loss=0.2818, pruned_loss=0.06052, over 4276511.90 frames. ], batch size: 441, lr: 3.35e-03, grad_scale: 32.0 2023-06-26 05:20:50,646 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-26 05:21:11,490 INFO [train.py:1028] (2/4) Epoch 9, validation: loss=0.2616, simple_loss=0.3531, pruned_loss=0.08508, over 1796401.00 frames. 2023-06-26 05:21:11,491 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 23731MB 2023-06-26 05:21:19,527 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1499742.0, ans=0.0 2023-06-26 05:22:11,222 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1499862.0, ans=0.125 2023-06-26 05:22:22,182 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1499922.0, ans=0.0 2023-06-26 05:23:08,662 INFO [train.py:996] (2/4) Epoch 9, batch 6050, loss[loss=0.1939, simple_loss=0.2577, pruned_loss=0.06502, over 21279.00 frames. ], tot_loss[loss=0.2005, simple_loss=0.2777, pruned_loss=0.06169, over 4271805.92 frames. ], batch size: 176, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 05:23:27,266 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.77 vs. limit=6.0 2023-06-26 05:23:48,267 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.995e+02 4.915e+02 7.181e+02 1.064e+03 2.049e+03, threshold=1.436e+03, percent-clipped=12.0 2023-06-26 05:24:26,047 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.99 vs. limit=15.0 2023-06-26 05:24:55,885 INFO [train.py:996] (2/4) Epoch 9, batch 6100, loss[loss=0.2031, simple_loss=0.281, pruned_loss=0.06263, over 21823.00 frames. ], tot_loss[loss=0.1984, simple_loss=0.2757, pruned_loss=0.06057, over 4275461.04 frames. ], batch size: 282, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 05:25:26,555 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1500402.0, ans=0.125 2023-06-26 05:25:29,684 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1500402.0, ans=0.0 2023-06-26 05:25:43,522 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1500462.0, ans=0.125 2023-06-26 05:25:52,767 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1500462.0, ans=0.125 2023-06-26 05:25:57,566 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1500522.0, ans=0.125 2023-06-26 05:26:29,236 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.36 vs. limit=15.0 2023-06-26 05:26:43,522 INFO [train.py:996] (2/4) Epoch 9, batch 6150, loss[loss=0.1949, simple_loss=0.2746, pruned_loss=0.05755, over 21483.00 frames. ], tot_loss[loss=0.2018, simple_loss=0.278, pruned_loss=0.06281, over 4286354.19 frames. ], batch size: 212, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 05:27:23,508 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.389e+02 4.733e+02 6.899e+02 9.489e+02 3.075e+03, threshold=1.380e+03, percent-clipped=10.0 2023-06-26 05:27:26,079 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1500762.0, ans=0.125 2023-06-26 05:28:32,076 INFO [train.py:996] (2/4) Epoch 9, batch 6200, loss[loss=0.1791, simple_loss=0.2509, pruned_loss=0.0536, over 21187.00 frames. ], tot_loss[loss=0.2052, simple_loss=0.2824, pruned_loss=0.064, over 4283108.92 frames. ], batch size: 143, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 05:28:50,571 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1501002.0, ans=0.125 2023-06-26 05:29:10,408 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1501002.0, ans=0.125 2023-06-26 05:29:19,182 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1501062.0, ans=0.07 2023-06-26 05:29:32,772 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.06 vs. limit=6.0 2023-06-26 05:29:47,534 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1501122.0, ans=0.0 2023-06-26 05:29:50,993 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1501122.0, ans=0.125 2023-06-26 05:29:51,485 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.90 vs. limit=15.0 2023-06-26 05:30:01,342 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1501182.0, ans=0.125 2023-06-26 05:30:20,104 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1501242.0, ans=0.125 2023-06-26 05:30:21,402 INFO [train.py:996] (2/4) Epoch 9, batch 6250, loss[loss=0.2029, simple_loss=0.2659, pruned_loss=0.06995, over 21212.00 frames. ], tot_loss[loss=0.2064, simple_loss=0.286, pruned_loss=0.06342, over 4279926.07 frames. ], batch size: 608, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 05:30:24,433 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.09 vs. limit=22.5 2023-06-26 05:30:58,437 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_ff3.min_abs, batch_count=1501302.0, ans=0.2 2023-06-26 05:31:01,153 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.531e+02 5.718e+02 9.151e+02 1.565e+03 3.193e+03, threshold=1.830e+03, percent-clipped=32.0 2023-06-26 05:31:22,803 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.76 vs. limit=15.0 2023-06-26 05:31:29,157 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1501422.0, ans=0.0 2023-06-26 05:32:09,877 INFO [train.py:996] (2/4) Epoch 9, batch 6300, loss[loss=0.2118, simple_loss=0.2872, pruned_loss=0.06815, over 21730.00 frames. ], tot_loss[loss=0.2074, simple_loss=0.2895, pruned_loss=0.06268, over 4277299.74 frames. ], batch size: 389, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 05:32:29,218 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1501542.0, ans=0.0 2023-06-26 05:32:58,110 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1501662.0, ans=0.125 2023-06-26 05:33:38,963 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1501722.0, ans=0.0 2023-06-26 05:33:40,898 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1501782.0, ans=0.04949747468305833 2023-06-26 05:34:00,228 INFO [train.py:996] (2/4) Epoch 9, batch 6350, loss[loss=0.243, simple_loss=0.3017, pruned_loss=0.09213, over 21455.00 frames. ], tot_loss[loss=0.2137, simple_loss=0.293, pruned_loss=0.06719, over 4280102.73 frames. ], batch size: 548, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 05:34:25,803 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.46 vs. limit=15.0 2023-06-26 05:34:41,783 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1501902.0, ans=0.0 2023-06-26 05:34:52,168 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.804e+02 5.467e+02 7.732e+02 1.098e+03 2.787e+03, threshold=1.546e+03, percent-clipped=5.0 2023-06-26 05:35:20,602 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1502022.0, ans=0.125 2023-06-26 05:35:28,019 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1502022.0, ans=0.0 2023-06-26 05:35:55,812 INFO [train.py:996] (2/4) Epoch 9, batch 6400, loss[loss=0.2693, simple_loss=0.3521, pruned_loss=0.0932, over 21542.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.299, pruned_loss=0.07098, over 4271542.85 frames. ], batch size: 414, lr: 3.34e-03, grad_scale: 32.0 2023-06-26 05:36:32,065 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.13 vs. limit=15.0 2023-06-26 05:36:42,108 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1502262.0, ans=0.05 2023-06-26 05:36:53,091 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.62 vs. limit=15.0 2023-06-26 05:37:21,768 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1502382.0, ans=0.2 2023-06-26 05:37:45,711 INFO [train.py:996] (2/4) Epoch 9, batch 6450, loss[loss=0.2061, simple_loss=0.2831, pruned_loss=0.06454, over 21826.00 frames. ], tot_loss[loss=0.22, simple_loss=0.3003, pruned_loss=0.06981, over 4278133.29 frames. ], batch size: 371, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 05:37:47,890 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1502442.0, ans=0.2 2023-06-26 05:38:33,809 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.613e+02 5.276e+02 6.947e+02 1.153e+03 2.587e+03, threshold=1.389e+03, percent-clipped=9.0 2023-06-26 05:39:21,621 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1502682.0, ans=0.125 2023-06-26 05:39:26,529 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1502682.0, ans=0.125 2023-06-26 05:39:35,584 INFO [train.py:996] (2/4) Epoch 9, batch 6500, loss[loss=0.1963, simple_loss=0.2694, pruned_loss=0.06163, over 21398.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2935, pruned_loss=0.0684, over 4282545.23 frames. ], batch size: 194, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 05:39:51,038 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.21 vs. limit=15.0 2023-06-26 05:40:34,231 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1502862.0, ans=0.0 2023-06-26 05:40:47,842 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1502922.0, ans=0.0 2023-06-26 05:41:17,279 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.10 vs. limit=15.0 2023-06-26 05:41:30,790 INFO [train.py:996] (2/4) Epoch 9, batch 6550, loss[loss=0.2051, simple_loss=0.2959, pruned_loss=0.05718, over 21609.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2952, pruned_loss=0.06791, over 4278975.59 frames. ], batch size: 263, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 05:42:19,574 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.317e+02 4.890e+02 6.578e+02 1.052e+03 2.225e+03, threshold=1.316e+03, percent-clipped=12.0 2023-06-26 05:42:32,432 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1503222.0, ans=0.2 2023-06-26 05:43:12,660 INFO [train.py:996] (2/4) Epoch 9, batch 6600, loss[loss=0.1741, simple_loss=0.2458, pruned_loss=0.0512, over 21744.00 frames. ], tot_loss[loss=0.2113, simple_loss=0.2884, pruned_loss=0.0671, over 4285939.95 frames. ], batch size: 112, lr: 3.34e-03, grad_scale: 8.0 2023-06-26 05:43:59,784 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 05:44:15,794 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=12.07 vs. limit=15.0 2023-06-26 05:44:38,926 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1503522.0, ans=0.125 2023-06-26 05:45:04,869 INFO [train.py:996] (2/4) Epoch 9, batch 6650, loss[loss=0.1756, simple_loss=0.2601, pruned_loss=0.04561, over 21798.00 frames. ], tot_loss[loss=0.2056, simple_loss=0.2826, pruned_loss=0.06434, over 4276702.94 frames. ], batch size: 352, lr: 3.34e-03, grad_scale: 8.0 2023-06-26 05:45:35,362 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1503702.0, ans=0.1 2023-06-26 05:45:50,676 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1503762.0, ans=0.1 2023-06-26 05:45:53,450 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.932e+02 4.741e+02 6.331e+02 9.151e+02 2.148e+03, threshold=1.266e+03, percent-clipped=9.0 2023-06-26 05:46:30,926 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1503882.0, ans=10.0 2023-06-26 05:46:51,118 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1503882.0, ans=10.0 2023-06-26 05:46:54,052 INFO [train.py:996] (2/4) Epoch 9, batch 6700, loss[loss=0.2209, simple_loss=0.2901, pruned_loss=0.07586, over 21636.00 frames. ], tot_loss[loss=0.2022, simple_loss=0.2776, pruned_loss=0.0634, over 4270000.08 frames. ], batch size: 415, lr: 3.34e-03, grad_scale: 8.0 2023-06-26 05:47:46,854 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1504062.0, ans=0.125 2023-06-26 05:48:36,344 INFO [train.py:996] (2/4) Epoch 9, batch 6750, loss[loss=0.2236, simple_loss=0.2874, pruned_loss=0.07986, over 21333.00 frames. ], tot_loss[loss=0.202, simple_loss=0.2762, pruned_loss=0.0639, over 4273041.93 frames. ], batch size: 159, lr: 3.34e-03, grad_scale: 8.0 2023-06-26 05:48:53,656 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1504242.0, ans=0.0 2023-06-26 05:49:21,854 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.29 vs. limit=15.0 2023-06-26 05:49:31,083 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.439e+02 4.588e+02 6.610e+02 8.394e+02 1.640e+03, threshold=1.322e+03, percent-clipped=2.0 2023-06-26 05:50:04,264 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1504422.0, ans=0.125 2023-06-26 05:50:23,612 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1504482.0, ans=0.125 2023-06-26 05:50:25,446 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1504482.0, ans=0.125 2023-06-26 05:50:29,565 INFO [train.py:996] (2/4) Epoch 9, batch 6800, loss[loss=0.2175, simple_loss=0.2768, pruned_loss=0.07908, over 21494.00 frames. ], tot_loss[loss=0.2056, simple_loss=0.2796, pruned_loss=0.06573, over 4260998.37 frames. ], batch size: 389, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 05:50:32,103 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1504542.0, ans=0.0 2023-06-26 05:50:37,299 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_ff3.min_abs, batch_count=1504542.0, ans=0.2 2023-06-26 05:50:42,235 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1504542.0, ans=0.125 2023-06-26 05:50:44,225 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 05:50:58,403 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1504602.0, ans=0.125 2023-06-26 05:51:13,699 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1504662.0, ans=0.2 2023-06-26 05:51:35,062 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1504722.0, ans=0.125 2023-06-26 05:51:43,474 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1504722.0, ans=0.125 2023-06-26 05:52:16,596 INFO [train.py:996] (2/4) Epoch 9, batch 6850, loss[loss=0.2286, simple_loss=0.2909, pruned_loss=0.08319, over 21764.00 frames. ], tot_loss[loss=0.2057, simple_loss=0.2778, pruned_loss=0.06679, over 4273895.92 frames. ], batch size: 441, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 05:52:39,685 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1504902.0, ans=0.2 2023-06-26 05:52:40,303 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.77 vs. limit=12.0 2023-06-26 05:53:05,513 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.496e+02 4.685e+02 7.280e+02 1.216e+03 2.418e+03, threshold=1.456e+03, percent-clipped=17.0 2023-06-26 05:53:39,549 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 05:54:02,827 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1505082.0, ans=0.1 2023-06-26 05:54:05,611 INFO [train.py:996] (2/4) Epoch 9, batch 6900, loss[loss=0.2088, simple_loss=0.2766, pruned_loss=0.07045, over 21874.00 frames. ], tot_loss[loss=0.2048, simple_loss=0.278, pruned_loss=0.06575, over 4279289.79 frames. ], batch size: 351, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 05:54:48,559 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1505202.0, ans=0.0 2023-06-26 05:55:02,464 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1505262.0, ans=0.1 2023-06-26 05:55:19,132 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.99 vs. limit=15.0 2023-06-26 05:55:40,909 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1505382.0, ans=0.05 2023-06-26 05:55:40,914 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1505382.0, ans=0.1 2023-06-26 05:55:54,130 INFO [train.py:996] (2/4) Epoch 9, batch 6950, loss[loss=0.206, simple_loss=0.2838, pruned_loss=0.06411, over 21626.00 frames. ], tot_loss[loss=0.2037, simple_loss=0.2806, pruned_loss=0.06342, over 4282918.09 frames. ], batch size: 263, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 05:56:43,258 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.242e+02 5.032e+02 6.537e+02 9.718e+02 2.265e+03, threshold=1.307e+03, percent-clipped=8.0 2023-06-26 05:57:42,917 INFO [train.py:996] (2/4) Epoch 9, batch 7000, loss[loss=0.2068, simple_loss=0.2639, pruned_loss=0.07491, over 21310.00 frames. ], tot_loss[loss=0.206, simple_loss=0.2823, pruned_loss=0.06488, over 4279567.08 frames. ], batch size: 159, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 05:58:22,812 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1505802.0, ans=0.1 2023-06-26 05:58:48,107 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.03 vs. limit=15.0 2023-06-26 05:58:49,201 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1505862.0, ans=0.1 2023-06-26 05:59:38,666 INFO [train.py:996] (2/4) Epoch 9, batch 7050, loss[loss=0.1941, simple_loss=0.268, pruned_loss=0.06012, over 21159.00 frames. ], tot_loss[loss=0.2039, simple_loss=0.28, pruned_loss=0.06387, over 4276944.81 frames. ], batch size: 607, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 05:59:39,236 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1506042.0, ans=0.2 2023-06-26 05:59:50,679 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.56 vs. limit=22.5 2023-06-26 05:59:55,197 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=1506042.0, ans=10.0 2023-06-26 06:00:24,864 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1506162.0, ans=0.0 2023-06-26 06:00:27,418 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.188e+02 4.829e+02 6.611e+02 8.594e+02 1.864e+03, threshold=1.322e+03, percent-clipped=11.0 2023-06-26 06:01:33,083 INFO [train.py:996] (2/4) Epoch 9, batch 7100, loss[loss=0.1788, simple_loss=0.264, pruned_loss=0.04678, over 21681.00 frames. ], tot_loss[loss=0.2076, simple_loss=0.2844, pruned_loss=0.06536, over 4272521.00 frames. ], batch size: 298, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 06:01:42,857 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.24 vs. limit=15.0 2023-06-26 06:01:58,209 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1506402.0, ans=0.0 2023-06-26 06:02:17,692 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1506462.0, ans=0.125 2023-06-26 06:03:22,332 INFO [train.py:996] (2/4) Epoch 9, batch 7150, loss[loss=0.1389, simple_loss=0.2121, pruned_loss=0.0328, over 21282.00 frames. ], tot_loss[loss=0.2043, simple_loss=0.2821, pruned_loss=0.06327, over 4275897.46 frames. ], batch size: 176, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 06:03:50,917 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1506702.0, ans=0.125 2023-06-26 06:04:05,116 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1506762.0, ans=0.025 2023-06-26 06:04:06,077 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.994e+02 4.588e+02 6.424e+02 8.469e+02 2.110e+03, threshold=1.285e+03, percent-clipped=2.0 2023-06-26 06:05:00,246 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1506882.0, ans=0.125 2023-06-26 06:05:11,696 INFO [train.py:996] (2/4) Epoch 9, batch 7200, loss[loss=0.2166, simple_loss=0.2772, pruned_loss=0.07795, over 21275.00 frames. ], tot_loss[loss=0.2089, simple_loss=0.2851, pruned_loss=0.06636, over 4281326.64 frames. ], batch size: 159, lr: 3.34e-03, grad_scale: 32.0 2023-06-26 06:05:29,747 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1506942.0, ans=0.125 2023-06-26 06:05:51,197 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1507062.0, ans=0.0 2023-06-26 06:06:43,943 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1507182.0, ans=0.125 2023-06-26 06:06:49,263 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1507182.0, ans=0.125 2023-06-26 06:06:59,130 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1507242.0, ans=0.1 2023-06-26 06:07:00,471 INFO [train.py:996] (2/4) Epoch 9, batch 7250, loss[loss=0.1878, simple_loss=0.2525, pruned_loss=0.0615, over 21586.00 frames. ], tot_loss[loss=0.2078, simple_loss=0.2813, pruned_loss=0.06712, over 4287074.32 frames. ], batch size: 298, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 06:07:45,481 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.193e+02 5.249e+02 7.377e+02 1.151e+03 2.707e+03, threshold=1.475e+03, percent-clipped=23.0 2023-06-26 06:07:48,160 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1507362.0, ans=0.2 2023-06-26 06:08:36,859 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1507482.0, ans=0.1 2023-06-26 06:08:47,820 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1507542.0, ans=0.0 2023-06-26 06:08:48,800 INFO [train.py:996] (2/4) Epoch 9, batch 7300, loss[loss=0.1945, simple_loss=0.2609, pruned_loss=0.06407, over 21669.00 frames. ], tot_loss[loss=0.2048, simple_loss=0.2761, pruned_loss=0.06679, over 4285002.57 frames. ], batch size: 333, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 06:10:28,915 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1507782.0, ans=0.0 2023-06-26 06:10:44,113 INFO [train.py:996] (2/4) Epoch 9, batch 7350, loss[loss=0.2327, simple_loss=0.3002, pruned_loss=0.08258, over 21455.00 frames. ], tot_loss[loss=0.2055, simple_loss=0.2758, pruned_loss=0.06762, over 4281929.56 frames. ], batch size: 194, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 06:10:56,892 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1507842.0, ans=0.2 2023-06-26 06:10:56,893 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1507842.0, ans=0.035 2023-06-26 06:11:30,329 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.284e+02 4.727e+02 6.627e+02 9.690e+02 1.819e+03, threshold=1.325e+03, percent-clipped=8.0 2023-06-26 06:12:03,315 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1508022.0, ans=0.0 2023-06-26 06:12:34,204 INFO [train.py:996] (2/4) Epoch 9, batch 7400, loss[loss=0.1943, simple_loss=0.2764, pruned_loss=0.05613, over 21560.00 frames. ], tot_loss[loss=0.2105, simple_loss=0.2823, pruned_loss=0.06938, over 4280484.94 frames. ], batch size: 230, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 06:12:49,726 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.86 vs. limit=22.5 2023-06-26 06:13:52,386 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1508322.0, ans=0.125 2023-06-26 06:14:22,341 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1508382.0, ans=0.125 2023-06-26 06:14:25,307 INFO [train.py:996] (2/4) Epoch 9, batch 7450, loss[loss=0.1938, simple_loss=0.262, pruned_loss=0.06281, over 15529.00 frames. ], tot_loss[loss=0.2092, simple_loss=0.2808, pruned_loss=0.06886, over 4259809.38 frames. ], batch size: 62, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 06:14:51,541 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1508502.0, ans=0.125 2023-06-26 06:14:52,394 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.63 vs. limit=10.0 2023-06-26 06:15:23,589 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.343e+02 4.976e+02 6.577e+02 1.050e+03 2.324e+03, threshold=1.315e+03, percent-clipped=17.0 2023-06-26 06:15:26,415 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1508562.0, ans=0.125 2023-06-26 06:15:30,344 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1508562.0, ans=0.125 2023-06-26 06:15:39,113 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=1508622.0, ans=0.95 2023-06-26 06:16:00,996 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1508682.0, ans=0.1 2023-06-26 06:16:18,126 INFO [train.py:996] (2/4) Epoch 9, batch 7500, loss[loss=0.2431, simple_loss=0.35, pruned_loss=0.0681, over 21915.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.2884, pruned_loss=0.07016, over 4266622.53 frames. ], batch size: 372, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 06:18:08,919 INFO [train.py:996] (2/4) Epoch 9, batch 7550, loss[loss=0.2065, simple_loss=0.3077, pruned_loss=0.05263, over 21656.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2947, pruned_loss=0.06919, over 4269258.87 frames. ], batch size: 414, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 06:18:25,996 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.22 vs. limit=15.0 2023-06-26 06:19:04,856 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.393e+02 6.002e+02 8.588e+02 1.350e+03 2.877e+03, threshold=1.718e+03, percent-clipped=25.0 2023-06-26 06:19:56,676 INFO [train.py:996] (2/4) Epoch 9, batch 7600, loss[loss=0.225, simple_loss=0.2922, pruned_loss=0.07894, over 21327.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2944, pruned_loss=0.06818, over 4275122.05 frames. ], batch size: 159, lr: 3.33e-03, grad_scale: 32.0 2023-06-26 06:21:46,203 INFO [train.py:996] (2/4) Epoch 9, batch 7650, loss[loss=0.2348, simple_loss=0.3097, pruned_loss=0.07998, over 21878.00 frames. ], tot_loss[loss=0.2158, simple_loss=0.2928, pruned_loss=0.06941, over 4288938.48 frames. ], batch size: 118, lr: 3.33e-03, grad_scale: 16.0 2023-06-26 06:21:54,015 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1509642.0, ans=0.0 2023-06-26 06:21:57,915 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.25 vs. limit=22.5 2023-06-26 06:22:06,377 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.68 vs. limit=10.0 2023-06-26 06:22:43,951 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.446e+02 5.104e+02 7.952e+02 1.146e+03 1.972e+03, threshold=1.590e+03, percent-clipped=6.0 2023-06-26 06:23:19,056 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=1509882.0, ans=15.0 2023-06-26 06:23:41,273 INFO [train.py:996] (2/4) Epoch 9, batch 7700, loss[loss=0.225, simple_loss=0.2986, pruned_loss=0.07566, over 21782.00 frames. ], tot_loss[loss=0.2192, simple_loss=0.2949, pruned_loss=0.07177, over 4293838.60 frames. ], batch size: 247, lr: 3.33e-03, grad_scale: 16.0 2023-06-26 06:24:04,998 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1510002.0, ans=0.07 2023-06-26 06:24:43,458 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=10.85 vs. limit=15.0 2023-06-26 06:25:21,102 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1510182.0, ans=0.125 2023-06-26 06:25:33,231 INFO [train.py:996] (2/4) Epoch 9, batch 7750, loss[loss=0.3243, simple_loss=0.4173, pruned_loss=0.1156, over 21510.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.2999, pruned_loss=0.07174, over 4288178.95 frames. ], batch size: 471, lr: 3.33e-03, grad_scale: 16.0 2023-06-26 06:26:04,209 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1510302.0, ans=0.125 2023-06-26 06:26:32,287 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.490e+02 5.408e+02 8.578e+02 1.362e+03 2.742e+03, threshold=1.716e+03, percent-clipped=14.0 2023-06-26 06:26:59,741 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1510422.0, ans=0.125 2023-06-26 06:27:34,361 INFO [train.py:996] (2/4) Epoch 9, batch 7800, loss[loss=0.1724, simple_loss=0.2029, pruned_loss=0.07096, over 16618.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.3018, pruned_loss=0.07242, over 4273187.87 frames. ], batch size: 61, lr: 3.33e-03, grad_scale: 16.0 2023-06-26 06:28:18,915 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1510662.0, ans=0.2 2023-06-26 06:28:38,083 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1510722.0, ans=0.125 2023-06-26 06:29:23,907 INFO [train.py:996] (2/4) Epoch 9, batch 7850, loss[loss=0.1812, simple_loss=0.2415, pruned_loss=0.06049, over 21333.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.2955, pruned_loss=0.07129, over 4264496.67 frames. ], batch size: 177, lr: 3.33e-03, grad_scale: 8.0 2023-06-26 06:29:24,451 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1510842.0, ans=0.035 2023-06-26 06:29:32,884 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1510842.0, ans=0.125 2023-06-26 06:29:38,616 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1510842.0, ans=0.125 2023-06-26 06:30:12,958 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.204e+02 4.902e+02 7.462e+02 1.114e+03 2.139e+03, threshold=1.492e+03, percent-clipped=5.0 2023-06-26 06:30:36,181 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1511022.0, ans=0.025 2023-06-26 06:31:15,010 INFO [train.py:996] (2/4) Epoch 9, batch 7900, loss[loss=0.2271, simple_loss=0.33, pruned_loss=0.06208, over 21777.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2919, pruned_loss=0.07067, over 4260294.04 frames. ], batch size: 351, lr: 3.33e-03, grad_scale: 8.0 2023-06-26 06:31:24,932 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.54 vs. limit=22.5 2023-06-26 06:31:24,935 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.49 vs. limit=10.0 2023-06-26 06:31:39,139 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1511202.0, ans=0.1 2023-06-26 06:32:13,273 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1511322.0, ans=0.125 2023-06-26 06:32:43,642 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1511382.0, ans=0.0 2023-06-26 06:32:43,699 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1511382.0, ans=0.0 2023-06-26 06:32:50,670 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1511382.0, ans=0.0 2023-06-26 06:33:07,273 INFO [train.py:996] (2/4) Epoch 9, batch 7950, loss[loss=0.2302, simple_loss=0.3212, pruned_loss=0.0696, over 21649.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.2946, pruned_loss=0.0704, over 4253447.59 frames. ], batch size: 389, lr: 3.33e-03, grad_scale: 8.0 2023-06-26 06:33:32,011 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.74 vs. limit=6.0 2023-06-26 06:33:43,862 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1511502.0, ans=0.0 2023-06-26 06:34:02,991 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.829e+02 6.422e+02 9.281e+02 1.330e+03 3.368e+03, threshold=1.856e+03, percent-clipped=18.0 2023-06-26 06:35:05,386 INFO [train.py:996] (2/4) Epoch 9, batch 8000, loss[loss=0.3092, simple_loss=0.3871, pruned_loss=0.1157, over 21432.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.2998, pruned_loss=0.07291, over 4257489.51 frames. ], batch size: 507, lr: 3.33e-03, grad_scale: 16.0 2023-06-26 06:35:08,201 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1511742.0, ans=0.0 2023-06-26 06:35:23,499 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1511802.0, ans=0.125 2023-06-26 06:35:28,237 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.17 vs. limit=15.0 2023-06-26 06:36:56,988 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.46 vs. limit=22.5 2023-06-26 06:37:01,466 INFO [train.py:996] (2/4) Epoch 9, batch 8050, loss[loss=0.3068, simple_loss=0.3905, pruned_loss=0.1115, over 21504.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.3052, pruned_loss=0.0738, over 4257471.91 frames. ], batch size: 471, lr: 3.33e-03, grad_scale: 16.0 2023-06-26 06:37:26,311 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1512102.0, ans=0.125 2023-06-26 06:38:01,842 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.472e+02 6.276e+02 8.546e+02 1.348e+03 3.651e+03, threshold=1.709e+03, percent-clipped=15.0 2023-06-26 06:38:51,612 INFO [train.py:996] (2/4) Epoch 9, batch 8100, loss[loss=0.2412, simple_loss=0.3219, pruned_loss=0.08023, over 21903.00 frames. ], tot_loss[loss=0.2248, simple_loss=0.3025, pruned_loss=0.07358, over 4264098.35 frames. ], batch size: 118, lr: 3.33e-03, grad_scale: 16.0 2023-06-26 06:39:44,504 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1512462.0, ans=0.125 2023-06-26 06:39:53,917 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1512462.0, ans=0.2 2023-06-26 06:40:04,865 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1512462.0, ans=0.125 2023-06-26 06:40:58,167 INFO [train.py:996] (2/4) Epoch 9, batch 8150, loss[loss=0.2154, simple_loss=0.3097, pruned_loss=0.06056, over 21757.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.3109, pruned_loss=0.07536, over 4260922.52 frames. ], batch size: 332, lr: 3.33e-03, grad_scale: 8.0 2023-06-26 06:41:46,530 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1512762.0, ans=0.1 2023-06-26 06:41:53,063 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1512762.0, ans=0.1 2023-06-26 06:41:54,139 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.538e+02 6.819e+02 1.034e+03 1.568e+03 4.387e+03, threshold=2.069e+03, percent-clipped=18.0 2023-06-26 06:42:09,193 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1512822.0, ans=0.125 2023-06-26 06:42:13,312 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.58 vs. limit=15.0 2023-06-26 06:42:24,807 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1512882.0, ans=0.04949747468305833 2023-06-26 06:42:35,706 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1512882.0, ans=0.125 2023-06-26 06:42:41,755 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.12 vs. limit=6.0 2023-06-26 06:42:49,112 INFO [train.py:996] (2/4) Epoch 9, batch 8200, loss[loss=0.1731, simple_loss=0.2398, pruned_loss=0.05317, over 21194.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.3023, pruned_loss=0.0722, over 4266041.67 frames. ], batch size: 159, lr: 3.33e-03, grad_scale: 8.0 2023-06-26 06:43:12,843 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1513002.0, ans=0.125 2023-06-26 06:43:35,937 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1513062.0, ans=0.125 2023-06-26 06:43:55,344 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1513122.0, ans=0.0 2023-06-26 06:44:40,621 INFO [train.py:996] (2/4) Epoch 9, batch 8250, loss[loss=0.2072, simple_loss=0.2999, pruned_loss=0.05726, over 21784.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.2986, pruned_loss=0.07095, over 4261692.57 frames. ], batch size: 282, lr: 3.33e-03, grad_scale: 8.0 2023-06-26 06:44:56,923 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1513242.0, ans=0.0 2023-06-26 06:45:12,184 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.46 vs. limit=15.0 2023-06-26 06:45:22,223 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.91 vs. limit=22.5 2023-06-26 06:45:23,506 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1513302.0, ans=0.1 2023-06-26 06:45:23,979 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.14 vs. limit=15.0 2023-06-26 06:45:25,767 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.50 vs. limit=22.5 2023-06-26 06:45:27,513 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.74 vs. limit=15.0 2023-06-26 06:45:30,371 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1513362.0, ans=0.0 2023-06-26 06:45:36,655 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.396e+02 4.867e+02 7.289e+02 1.042e+03 1.970e+03, threshold=1.458e+03, percent-clipped=0.0 2023-06-26 06:46:35,315 INFO [train.py:996] (2/4) Epoch 9, batch 8300, loss[loss=0.212, simple_loss=0.2987, pruned_loss=0.06263, over 21720.00 frames. ], tot_loss[loss=0.2164, simple_loss=0.2967, pruned_loss=0.06801, over 4260327.39 frames. ], batch size: 332, lr: 3.33e-03, grad_scale: 8.0 2023-06-26 06:46:37,592 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1513542.0, ans=0.2 2023-06-26 06:46:49,949 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 06:46:59,901 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1513602.0, ans=0.0 2023-06-26 06:48:25,383 INFO [train.py:996] (2/4) Epoch 9, batch 8350, loss[loss=0.1699, simple_loss=0.2591, pruned_loss=0.04029, over 21369.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.2951, pruned_loss=0.06634, over 4269629.02 frames. ], batch size: 131, lr: 3.33e-03, grad_scale: 8.0 2023-06-26 06:49:22,437 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.156e+02 5.177e+02 7.489e+02 1.153e+03 2.858e+03, threshold=1.498e+03, percent-clipped=11.0 2023-06-26 06:49:35,882 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.43 vs. limit=12.0 2023-06-26 06:49:37,482 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1514022.0, ans=0.2 2023-06-26 06:50:14,409 INFO [train.py:996] (2/4) Epoch 9, batch 8400, loss[loss=0.1765, simple_loss=0.2707, pruned_loss=0.04117, over 21750.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.2935, pruned_loss=0.06433, over 4274951.80 frames. ], batch size: 351, lr: 3.33e-03, grad_scale: 16.0 2023-06-26 06:50:16,873 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1514142.0, ans=0.2 2023-06-26 06:50:22,703 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.84 vs. limit=15.0 2023-06-26 06:51:14,394 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1514262.0, ans=0.05 2023-06-26 06:51:15,970 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1514322.0, ans=0.0 2023-06-26 06:51:22,718 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1514322.0, ans=0.0 2023-06-26 06:51:26,244 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1514322.0, ans=0.125 2023-06-26 06:51:53,825 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1514382.0, ans=0.1 2023-06-26 06:52:01,975 INFO [train.py:996] (2/4) Epoch 9, batch 8450, loss[loss=0.2201, simple_loss=0.302, pruned_loss=0.06914, over 17099.00 frames. ], tot_loss[loss=0.2105, simple_loss=0.2921, pruned_loss=0.06447, over 4275604.08 frames. ], batch size: 60, lr: 3.33e-03, grad_scale: 16.0 2023-06-26 06:52:03,076 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.31 vs. limit=22.5 2023-06-26 06:52:58,303 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.897e+02 4.182e+02 5.654e+02 7.712e+02 3.428e+03, threshold=1.131e+03, percent-clipped=11.0 2023-06-26 06:53:12,785 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1514622.0, ans=0.07 2023-06-26 06:53:23,299 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1514622.0, ans=0.1 2023-06-26 06:53:32,419 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1514682.0, ans=0.0 2023-06-26 06:53:51,632 INFO [train.py:996] (2/4) Epoch 9, batch 8500, loss[loss=0.2286, simple_loss=0.2952, pruned_loss=0.08098, over 21633.00 frames. ], tot_loss[loss=0.2096, simple_loss=0.2886, pruned_loss=0.06524, over 4276336.29 frames. ], batch size: 391, lr: 3.33e-03, grad_scale: 16.0 2023-06-26 06:54:03,848 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.34 vs. limit=15.0 2023-06-26 06:54:07,763 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.94 vs. limit=15.0 2023-06-26 06:54:20,650 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1514802.0, ans=0.0 2023-06-26 06:54:23,250 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.98 vs. limit=15.0 2023-06-26 06:55:14,639 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.63 vs. limit=10.0 2023-06-26 06:55:27,796 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1514982.0, ans=0.125 2023-06-26 06:55:42,994 INFO [train.py:996] (2/4) Epoch 9, batch 8550, loss[loss=0.2413, simple_loss=0.3317, pruned_loss=0.07548, over 21746.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.2924, pruned_loss=0.06762, over 4273234.30 frames. ], batch size: 351, lr: 3.33e-03, grad_scale: 16.0 2023-06-26 06:56:11,982 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1515102.0, ans=0.125 2023-06-26 06:56:15,261 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1515102.0, ans=0.0 2023-06-26 06:56:40,476 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.388e+02 5.673e+02 9.028e+02 1.285e+03 2.973e+03, threshold=1.806e+03, percent-clipped=33.0 2023-06-26 06:57:34,134 INFO [train.py:996] (2/4) Epoch 9, batch 8600, loss[loss=0.2315, simple_loss=0.3541, pruned_loss=0.05447, over 19834.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.3007, pruned_loss=0.06987, over 4261696.53 frames. ], batch size: 702, lr: 3.33e-03, grad_scale: 16.0 2023-06-26 06:58:44,150 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1515522.0, ans=0.0 2023-06-26 06:58:49,257 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1515522.0, ans=0.0 2023-06-26 06:59:23,768 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1515642.0, ans=0.125 2023-06-26 06:59:25,169 INFO [train.py:996] (2/4) Epoch 9, batch 8650, loss[loss=0.2089, simple_loss=0.3042, pruned_loss=0.05677, over 21643.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.3067, pruned_loss=0.07005, over 4269670.62 frames. ], batch size: 441, lr: 3.33e-03, grad_scale: 16.0 2023-06-26 07:00:17,781 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.10 vs. limit=12.0 2023-06-26 07:00:25,178 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.107e+02 4.849e+02 6.283e+02 8.957e+02 2.012e+03, threshold=1.257e+03, percent-clipped=3.0 2023-06-26 07:00:29,208 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1515762.0, ans=0.2 2023-06-26 07:00:40,401 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.98 vs. limit=15.0 2023-06-26 07:01:11,955 INFO [train.py:996] (2/4) Epoch 9, batch 8700, loss[loss=0.1913, simple_loss=0.2416, pruned_loss=0.07051, over 20267.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.2984, pruned_loss=0.06759, over 4272365.03 frames. ], batch size: 703, lr: 3.33e-03, grad_scale: 16.0 2023-06-26 07:02:02,862 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1516062.0, ans=0.125 2023-06-26 07:02:39,738 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1516182.0, ans=0.125 2023-06-26 07:02:53,686 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1516242.0, ans=0.2 2023-06-26 07:02:54,668 INFO [train.py:996] (2/4) Epoch 9, batch 8750, loss[loss=0.2399, simple_loss=0.3173, pruned_loss=0.08123, over 21467.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.2945, pruned_loss=0.0687, over 4280004.69 frames. ], batch size: 548, lr: 3.33e-03, grad_scale: 16.0 2023-06-26 07:03:35,642 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.40 vs. limit=15.0 2023-06-26 07:04:02,678 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.478e+02 4.871e+02 5.858e+02 9.020e+02 2.163e+03, threshold=1.172e+03, percent-clipped=9.0 2023-06-26 07:04:51,221 INFO [train.py:996] (2/4) Epoch 9, batch 8800, loss[loss=0.2527, simple_loss=0.3657, pruned_loss=0.06992, over 19818.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.3016, pruned_loss=0.07083, over 4277742.77 frames. ], batch size: 702, lr: 3.33e-03, grad_scale: 32.0 2023-06-26 07:05:44,243 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.83 vs. limit=15.0 2023-06-26 07:05:45,418 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1516662.0, ans=0.0 2023-06-26 07:06:25,929 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1516782.0, ans=0.125 2023-06-26 07:06:33,003 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1516782.0, ans=0.125 2023-06-26 07:06:46,521 INFO [train.py:996] (2/4) Epoch 9, batch 8850, loss[loss=0.2475, simple_loss=0.3252, pruned_loss=0.08491, over 21683.00 frames. ], tot_loss[loss=0.2266, simple_loss=0.3092, pruned_loss=0.07197, over 4274040.55 frames. ], batch size: 441, lr: 3.33e-03, grad_scale: 32.0 2023-06-26 07:07:00,999 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1516842.0, ans=0.125 2023-06-26 07:07:43,097 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.458e+02 5.018e+02 7.490e+02 1.008e+03 2.036e+03, threshold=1.498e+03, percent-clipped=19.0 2023-06-26 07:07:56,435 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1517022.0, ans=0.125 2023-06-26 07:08:08,336 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1517022.0, ans=0.1 2023-06-26 07:08:37,008 INFO [train.py:996] (2/4) Epoch 9, batch 8900, loss[loss=0.1949, simple_loss=0.2715, pruned_loss=0.05918, over 21624.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.3045, pruned_loss=0.07117, over 4270787.74 frames. ], batch size: 298, lr: 3.33e-03, grad_scale: 16.0 2023-06-26 07:09:10,767 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1517202.0, ans=0.125 2023-06-26 07:10:02,559 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1517322.0, ans=0.125 2023-06-26 07:10:22,975 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.63 vs. limit=22.5 2023-06-26 07:10:34,130 INFO [train.py:996] (2/4) Epoch 9, batch 8950, loss[loss=0.2645, simple_loss=0.3635, pruned_loss=0.08273, over 21192.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.3054, pruned_loss=0.0706, over 4273070.96 frames. ], batch size: 549, lr: 3.33e-03, grad_scale: 16.0 2023-06-26 07:11:15,184 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1517562.0, ans=0.125 2023-06-26 07:11:15,249 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1517562.0, ans=0.125 2023-06-26 07:11:31,721 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.717e+02 6.385e+02 1.007e+03 1.831e+03 3.231e+03, threshold=2.014e+03, percent-clipped=34.0 2023-06-26 07:12:29,543 INFO [train.py:996] (2/4) Epoch 9, batch 9000, loss[loss=0.1984, simple_loss=0.2781, pruned_loss=0.05938, over 21670.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.2986, pruned_loss=0.06954, over 4266310.03 frames. ], batch size: 282, lr: 3.33e-03, grad_scale: 16.0 2023-06-26 07:12:29,544 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-26 07:12:43,328 INFO [zipformer.py:1728] (2/4) name=encoder.encoders.3.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([3.5370, 2.9842, 3.1032, 3.6075, 2.0844, 3.3601, 3.3486, 2.4780], device='cuda:2') 2023-06-26 07:12:47,776 INFO [train.py:1028] (2/4) Epoch 9, validation: loss=0.2687, simple_loss=0.357, pruned_loss=0.09027, over 1796401.00 frames. 2023-06-26 07:12:47,777 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 23731MB 2023-06-26 07:12:54,441 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1517742.0, ans=0.0 2023-06-26 07:13:13,481 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1517802.0, ans=0.125 2023-06-26 07:13:32,551 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1517862.0, ans=0.0 2023-06-26 07:13:32,597 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1517862.0, ans=0.125 2023-06-26 07:14:17,997 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1517922.0, ans=0.125 2023-06-26 07:14:21,630 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1517982.0, ans=0.0 2023-06-26 07:14:37,511 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1518042.0, ans=0.0 2023-06-26 07:14:38,786 INFO [train.py:996] (2/4) Epoch 9, batch 9050, loss[loss=0.2301, simple_loss=0.3073, pruned_loss=0.07647, over 21436.00 frames. ], tot_loss[loss=0.2133, simple_loss=0.2944, pruned_loss=0.06613, over 4268494.02 frames. ], batch size: 194, lr: 3.33e-03, grad_scale: 16.0 2023-06-26 07:15:10,517 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.92 vs. limit=12.0 2023-06-26 07:15:17,484 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1518102.0, ans=0.2 2023-06-26 07:15:33,552 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 07:15:35,976 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.17 vs. limit=6.0 2023-06-26 07:15:38,266 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.242e+02 4.774e+02 6.783e+02 1.195e+03 2.023e+03, threshold=1.357e+03, percent-clipped=1.0 2023-06-26 07:16:07,050 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1518222.0, ans=0.125 2023-06-26 07:16:18,284 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1518282.0, ans=0.1 2023-06-26 07:16:30,156 INFO [train.py:996] (2/4) Epoch 9, batch 9100, loss[loss=0.1962, simple_loss=0.2946, pruned_loss=0.04891, over 21642.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.2985, pruned_loss=0.06802, over 4267251.72 frames. ], batch size: 263, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 07:16:32,652 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1518342.0, ans=0.09899494936611666 2023-06-26 07:16:34,438 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1518342.0, ans=0.125 2023-06-26 07:17:21,088 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1518462.0, ans=0.2 2023-06-26 07:18:11,068 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1518582.0, ans=0.125 2023-06-26 07:18:20,688 INFO [train.py:996] (2/4) Epoch 9, batch 9150, loss[loss=0.2116, simple_loss=0.2993, pruned_loss=0.06198, over 21638.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.3003, pruned_loss=0.06574, over 4267864.45 frames. ], batch size: 263, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 07:18:23,680 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.27 vs. limit=6.0 2023-06-26 07:18:51,009 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1518702.0, ans=0.0 2023-06-26 07:18:52,828 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1518702.0, ans=0.0 2023-06-26 07:19:29,336 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.947e+02 4.682e+02 7.293e+02 9.875e+02 2.025e+03, threshold=1.459e+03, percent-clipped=11.0 2023-06-26 07:19:33,754 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1518822.0, ans=0.1 2023-06-26 07:20:14,569 INFO [train.py:996] (2/4) Epoch 9, batch 9200, loss[loss=0.2625, simple_loss=0.3355, pruned_loss=0.09474, over 21823.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.3017, pruned_loss=0.06572, over 4263980.75 frames. ], batch size: 124, lr: 3.32e-03, grad_scale: 32.0 2023-06-26 07:20:24,326 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1518942.0, ans=0.035 2023-06-26 07:21:26,226 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1519122.0, ans=0.95 2023-06-26 07:21:28,176 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1519122.0, ans=0.125 2023-06-26 07:21:32,223 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.60 vs. limit=15.0 2023-06-26 07:21:56,786 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1519182.0, ans=0.2 2023-06-26 07:21:56,816 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1519182.0, ans=0.1 2023-06-26 07:22:03,208 INFO [train.py:996] (2/4) Epoch 9, batch 9250, loss[loss=0.1998, simple_loss=0.2669, pruned_loss=0.06637, over 21641.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.3044, pruned_loss=0.06858, over 4267313.47 frames. ], batch size: 298, lr: 3.32e-03, grad_scale: 32.0 2023-06-26 07:22:05,713 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1519242.0, ans=0.2 2023-06-26 07:22:12,764 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.63 vs. limit=15.0 2023-06-26 07:22:53,789 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.06 vs. limit=15.0 2023-06-26 07:23:02,590 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.05 vs. limit=22.5 2023-06-26 07:23:06,221 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.29 vs. limit=15.0 2023-06-26 07:23:06,661 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.432e+02 5.072e+02 7.125e+02 1.070e+03 2.650e+03, threshold=1.425e+03, percent-clipped=11.0 2023-06-26 07:23:16,955 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.75 vs. limit=15.0 2023-06-26 07:23:48,240 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1519482.0, ans=0.0 2023-06-26 07:23:53,084 INFO [train.py:996] (2/4) Epoch 9, batch 9300, loss[loss=0.3089, simple_loss=0.3936, pruned_loss=0.1121, over 21478.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.2996, pruned_loss=0.0688, over 4270612.57 frames. ], batch size: 471, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 07:25:05,542 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1519722.0, ans=0.125 2023-06-26 07:25:28,325 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1519782.0, ans=0.2 2023-06-26 07:25:43,699 INFO [train.py:996] (2/4) Epoch 9, batch 9350, loss[loss=0.2604, simple_loss=0.3389, pruned_loss=0.09092, over 21796.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.3064, pruned_loss=0.07022, over 4278412.23 frames. ], batch size: 441, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 07:26:48,930 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.23 vs. limit=10.0 2023-06-26 07:26:54,936 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.694e+02 5.143e+02 7.806e+02 1.433e+03 2.856e+03, threshold=1.561e+03, percent-clipped=26.0 2023-06-26 07:27:02,246 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1520022.0, ans=0.1 2023-06-26 07:27:12,711 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1520022.0, ans=0.125 2023-06-26 07:27:38,724 INFO [train.py:996] (2/4) Epoch 9, batch 9400, loss[loss=0.1905, simple_loss=0.2574, pruned_loss=0.06175, over 21545.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.3054, pruned_loss=0.07042, over 4279734.60 frames. ], batch size: 263, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 07:28:24,856 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1520262.0, ans=0.125 2023-06-26 07:28:57,024 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1520322.0, ans=0.1 2023-06-26 07:29:00,413 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1520322.0, ans=0.0 2023-06-26 07:29:31,960 INFO [train.py:996] (2/4) Epoch 9, batch 9450, loss[loss=0.2663, simple_loss=0.4052, pruned_loss=0.06368, over 19811.00 frames. ], tot_loss[loss=0.219, simple_loss=0.2986, pruned_loss=0.06968, over 4277558.51 frames. ], batch size: 702, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 07:30:31,557 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.480e+02 5.776e+02 8.947e+02 1.514e+03 4.644e+03, threshold=1.789e+03, percent-clipped=22.0 2023-06-26 07:31:21,074 INFO [train.py:996] (2/4) Epoch 9, batch 9500, loss[loss=0.1787, simple_loss=0.2644, pruned_loss=0.04652, over 21711.00 frames. ], tot_loss[loss=0.2142, simple_loss=0.2921, pruned_loss=0.06813, over 4262601.69 frames. ], batch size: 332, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 07:31:50,247 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1520802.0, ans=0.125 2023-06-26 07:33:12,883 INFO [train.py:996] (2/4) Epoch 9, batch 9550, loss[loss=0.2186, simple_loss=0.296, pruned_loss=0.0706, over 21547.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.2967, pruned_loss=0.07115, over 4271111.46 frames. ], batch size: 263, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 07:33:29,375 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1521042.0, ans=0.0 2023-06-26 07:34:11,608 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.380e+02 4.672e+02 5.675e+02 8.285e+02 1.544e+03, threshold=1.135e+03, percent-clipped=0.0 2023-06-26 07:34:22,992 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1521222.0, ans=0.04949747468305833 2023-06-26 07:35:01,330 INFO [train.py:996] (2/4) Epoch 9, batch 9600, loss[loss=0.2, simple_loss=0.2758, pruned_loss=0.06212, over 21369.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.2982, pruned_loss=0.07248, over 4275467.64 frames. ], batch size: 176, lr: 3.32e-03, grad_scale: 32.0 2023-06-26 07:35:56,552 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1521462.0, ans=0.125 2023-06-26 07:35:58,588 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1521462.0, ans=0.0 2023-06-26 07:36:49,673 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 07:36:50,375 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.82 vs. limit=6.0 2023-06-26 07:36:52,877 INFO [train.py:996] (2/4) Epoch 9, batch 9650, loss[loss=0.2299, simple_loss=0.3111, pruned_loss=0.07436, over 21450.00 frames. ], tot_loss[loss=0.222, simple_loss=0.2993, pruned_loss=0.07234, over 4280907.28 frames. ], batch size: 211, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 07:37:11,153 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1521702.0, ans=0.0 2023-06-26 07:37:49,607 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.333e+02 4.623e+02 6.972e+02 1.187e+03 2.800e+03, threshold=1.394e+03, percent-clipped=26.0 2023-06-26 07:37:53,689 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1521822.0, ans=0.1 2023-06-26 07:37:57,484 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1521822.0, ans=0.125 2023-06-26 07:38:17,715 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1521882.0, ans=0.125 2023-06-26 07:38:22,643 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1521882.0, ans=0.125 2023-06-26 07:38:38,190 INFO [train.py:996] (2/4) Epoch 9, batch 9700, loss[loss=0.2411, simple_loss=0.3345, pruned_loss=0.07387, over 21748.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.3003, pruned_loss=0.07244, over 4281309.78 frames. ], batch size: 351, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 07:38:46,520 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.37 vs. limit=15.0 2023-06-26 07:38:57,561 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1522002.0, ans=0.125 2023-06-26 07:39:08,055 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 07:39:36,377 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1522122.0, ans=0.125 2023-06-26 07:39:38,200 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1522122.0, ans=0.09899494936611666 2023-06-26 07:39:47,284 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1522122.0, ans=0.0 2023-06-26 07:39:59,344 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1522122.0, ans=0.0 2023-06-26 07:40:15,153 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1522182.0, ans=0.0 2023-06-26 07:40:27,215 INFO [train.py:996] (2/4) Epoch 9, batch 9750, loss[loss=0.1869, simple_loss=0.2548, pruned_loss=0.05946, over 21551.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.2937, pruned_loss=0.071, over 4268082.50 frames. ], batch size: 391, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 07:40:39,471 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1522242.0, ans=0.125 2023-06-26 07:40:53,361 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1522302.0, ans=0.1 2023-06-26 07:41:15,603 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1522362.0, ans=0.0 2023-06-26 07:41:22,046 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.554e+02 4.804e+02 6.885e+02 8.968e+02 2.424e+03, threshold=1.377e+03, percent-clipped=5.0 2023-06-26 07:41:32,226 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1522422.0, ans=0.2 2023-06-26 07:41:32,261 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1522422.0, ans=0.0 2023-06-26 07:42:07,402 INFO [train.py:996] (2/4) Epoch 9, batch 9800, loss[loss=0.1933, simple_loss=0.2718, pruned_loss=0.05746, over 21822.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.2941, pruned_loss=0.07163, over 4252506.98 frames. ], batch size: 298, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 07:42:51,180 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.03 vs. limit=15.0 2023-06-26 07:43:11,482 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1522722.0, ans=0.125 2023-06-26 07:43:22,125 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1522722.0, ans=0.1 2023-06-26 07:43:57,299 INFO [train.py:996] (2/4) Epoch 9, batch 9850, loss[loss=0.2039, simple_loss=0.2721, pruned_loss=0.06787, over 21348.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.2923, pruned_loss=0.07115, over 4252384.17 frames. ], batch size: 177, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 07:44:58,523 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.446e+02 4.827e+02 6.671e+02 1.006e+03 2.121e+03, threshold=1.334e+03, percent-clipped=9.0 2023-06-26 07:45:52,818 INFO [train.py:996] (2/4) Epoch 9, batch 9900, loss[loss=0.2123, simple_loss=0.292, pruned_loss=0.06632, over 21750.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2891, pruned_loss=0.07095, over 4256904.54 frames. ], batch size: 333, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 07:46:38,651 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1523262.0, ans=0.125 2023-06-26 07:46:52,407 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1523322.0, ans=0.2 2023-06-26 07:46:54,594 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.27 vs. limit=12.0 2023-06-26 07:47:20,852 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1523382.0, ans=0.125 2023-06-26 07:47:20,904 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1523382.0, ans=0.0 2023-06-26 07:47:35,629 INFO [train.py:996] (2/4) Epoch 9, batch 9950, loss[loss=0.213, simple_loss=0.2727, pruned_loss=0.07671, over 21568.00 frames. ], tot_loss[loss=0.2172, simple_loss=0.2904, pruned_loss=0.07197, over 4238753.28 frames. ], batch size: 415, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 07:47:53,638 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=2.97 vs. limit=12.0 2023-06-26 07:48:17,508 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.49 vs. limit=15.0 2023-06-26 07:48:20,479 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1523562.0, ans=0.125 2023-06-26 07:48:38,020 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.372e+02 4.966e+02 6.562e+02 9.646e+02 1.795e+03, threshold=1.312e+03, percent-clipped=7.0 2023-06-26 07:48:58,474 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1523622.0, ans=0.0 2023-06-26 07:49:16,616 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1523682.0, ans=0.1 2023-06-26 07:49:20,361 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1523682.0, ans=0.0 2023-06-26 07:49:31,820 INFO [train.py:996] (2/4) Epoch 9, batch 10000, loss[loss=0.17, simple_loss=0.222, pruned_loss=0.05899, over 20861.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2855, pruned_loss=0.07071, over 4247206.98 frames. ], batch size: 613, lr: 3.32e-03, grad_scale: 32.0 2023-06-26 07:49:34,543 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1523742.0, ans=0.2 2023-06-26 07:50:17,023 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1523862.0, ans=10.0 2023-06-26 07:50:45,893 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1523922.0, ans=0.2 2023-06-26 07:50:49,632 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1523922.0, ans=0.125 2023-06-26 07:50:51,561 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1523922.0, ans=0.0 2023-06-26 07:51:22,430 INFO [train.py:996] (2/4) Epoch 9, batch 10050, loss[loss=0.1981, simple_loss=0.2696, pruned_loss=0.06333, over 21288.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2874, pruned_loss=0.07079, over 4252397.41 frames. ], batch size: 549, lr: 3.32e-03, grad_scale: 32.0 2023-06-26 07:52:31,218 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.556e+02 5.086e+02 7.732e+02 1.194e+03 2.294e+03, threshold=1.546e+03, percent-clipped=16.0 2023-06-26 07:53:08,332 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1524282.0, ans=0.125 2023-06-26 07:53:13,026 INFO [train.py:996] (2/4) Epoch 9, batch 10100, loss[loss=0.2135, simple_loss=0.2951, pruned_loss=0.066, over 21890.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.2869, pruned_loss=0.06935, over 4259982.37 frames. ], batch size: 316, lr: 3.32e-03, grad_scale: 32.0 2023-06-26 07:54:39,908 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1524522.0, ans=0.0 2023-06-26 07:55:01,510 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.38 vs. limit=12.0 2023-06-26 07:55:07,057 INFO [train.py:996] (2/4) Epoch 9, batch 10150, loss[loss=0.1987, simple_loss=0.2741, pruned_loss=0.06164, over 21252.00 frames. ], tot_loss[loss=0.219, simple_loss=0.2936, pruned_loss=0.07221, over 4260146.64 frames. ], batch size: 159, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 07:55:54,052 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1524762.0, ans=0.125 2023-06-26 07:55:55,736 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1524762.0, ans=0.0 2023-06-26 07:56:04,321 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1524762.0, ans=0.0 2023-06-26 07:56:10,907 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.445e+02 5.423e+02 7.380e+02 1.011e+03 1.635e+03, threshold=1.476e+03, percent-clipped=1.0 2023-06-26 07:56:56,539 INFO [train.py:996] (2/4) Epoch 9, batch 10200, loss[loss=0.2221, simple_loss=0.3603, pruned_loss=0.04193, over 19771.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2918, pruned_loss=0.06978, over 4265680.97 frames. ], batch size: 702, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 07:57:43,681 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.68 vs. limit=15.0 2023-06-26 07:57:51,594 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1525062.0, ans=0.125 2023-06-26 07:58:26,133 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1525182.0, ans=0.0 2023-06-26 07:58:47,141 INFO [train.py:996] (2/4) Epoch 9, batch 10250, loss[loss=0.2161, simple_loss=0.3013, pruned_loss=0.06546, over 21580.00 frames. ], tot_loss[loss=0.2101, simple_loss=0.2889, pruned_loss=0.06563, over 4258037.60 frames. ], batch size: 414, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 07:59:58,343 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.778e+02 4.201e+02 6.167e+02 1.103e+03 3.116e+03, threshold=1.233e+03, percent-clipped=15.0 2023-06-26 08:00:38,969 INFO [train.py:996] (2/4) Epoch 9, batch 10300, loss[loss=0.242, simple_loss=0.3178, pruned_loss=0.0831, over 21250.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.293, pruned_loss=0.06654, over 4260793.23 frames. ], batch size: 159, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 08:00:47,158 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.86 vs. limit=6.0 2023-06-26 08:01:20,529 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1525602.0, ans=0.2 2023-06-26 08:02:30,554 INFO [train.py:996] (2/4) Epoch 9, batch 10350, loss[loss=0.1931, simple_loss=0.273, pruned_loss=0.05662, over 21829.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.2923, pruned_loss=0.06628, over 4257815.03 frames. ], batch size: 317, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 08:02:57,204 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.98 vs. limit=10.0 2023-06-26 08:02:58,409 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1525902.0, ans=0.0 2023-06-26 08:03:00,293 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1525902.0, ans=0.125 2023-06-26 08:03:46,312 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.376e+02 5.119e+02 7.830e+02 1.250e+03 2.539e+03, threshold=1.566e+03, percent-clipped=25.0 2023-06-26 08:04:33,036 INFO [train.py:996] (2/4) Epoch 9, batch 10400, loss[loss=0.1862, simple_loss=0.2563, pruned_loss=0.058, over 21672.00 frames. ], tot_loss[loss=0.2096, simple_loss=0.2874, pruned_loss=0.06593, over 4261625.57 frames. ], batch size: 263, lr: 3.32e-03, grad_scale: 32.0 2023-06-26 08:04:38,138 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.22 vs. limit=12.0 2023-06-26 08:04:44,619 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1526142.0, ans=0.0 2023-06-26 08:04:49,913 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1526202.0, ans=0.1 2023-06-26 08:04:51,885 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_ff2.min_abs, batch_count=1526202.0, ans=0.1 2023-06-26 08:05:05,493 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1526202.0, ans=0.0 2023-06-26 08:05:12,735 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1526202.0, ans=0.1 2023-06-26 08:05:27,114 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1526262.0, ans=0.0 2023-06-26 08:06:24,923 INFO [train.py:996] (2/4) Epoch 9, batch 10450, loss[loss=0.2527, simple_loss=0.3355, pruned_loss=0.08498, over 21663.00 frames. ], tot_loss[loss=0.2122, simple_loss=0.2894, pruned_loss=0.06747, over 4251707.23 frames. ], batch size: 414, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 08:06:28,042 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.51 vs. limit=15.0 2023-06-26 08:07:17,000 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.23 vs. limit=6.0 2023-06-26 08:07:29,716 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.606e+02 5.261e+02 7.908e+02 1.020e+03 2.061e+03, threshold=1.582e+03, percent-clipped=9.0 2023-06-26 08:07:52,366 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1526622.0, ans=0.125 2023-06-26 08:07:54,764 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.37 vs. limit=15.0 2023-06-26 08:08:14,069 INFO [train.py:996] (2/4) Epoch 9, batch 10500, loss[loss=0.2016, simple_loss=0.271, pruned_loss=0.06606, over 21439.00 frames. ], tot_loss[loss=0.2105, simple_loss=0.2893, pruned_loss=0.06589, over 4254243.54 frames. ], batch size: 389, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 08:08:30,945 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.99 vs. limit=15.0 2023-06-26 08:08:32,050 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1526742.0, ans=0.125 2023-06-26 08:09:27,251 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.19 vs. limit=10.0 2023-06-26 08:10:02,784 INFO [train.py:996] (2/4) Epoch 9, batch 10550, loss[loss=0.2027, simple_loss=0.2734, pruned_loss=0.06607, over 21854.00 frames. ], tot_loss[loss=0.2074, simple_loss=0.2839, pruned_loss=0.06547, over 4250464.36 frames. ], batch size: 107, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 08:10:08,662 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1527042.0, ans=0.0 2023-06-26 08:10:18,411 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.15 vs. limit=15.0 2023-06-26 08:10:42,641 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1527102.0, ans=0.0 2023-06-26 08:10:42,655 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 08:10:45,290 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.62 vs. limit=15.0 2023-06-26 08:11:04,563 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1527222.0, ans=0.1 2023-06-26 08:11:07,154 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.351e+02 4.011e+02 5.575e+02 6.702e+02 2.123e+03, threshold=1.115e+03, percent-clipped=3.0 2023-06-26 08:11:43,591 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1527282.0, ans=0.125 2023-06-26 08:11:43,639 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1527282.0, ans=0.05 2023-06-26 08:11:47,864 INFO [train.py:996] (2/4) Epoch 9, batch 10600, loss[loss=0.225, simple_loss=0.2861, pruned_loss=0.0819, over 20168.00 frames. ], tot_loss[loss=0.2044, simple_loss=0.2797, pruned_loss=0.06453, over 4253994.85 frames. ], batch size: 707, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 08:12:02,078 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.59 vs. limit=22.5 2023-06-26 08:12:04,048 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=1527342.0, ans=15.0 2023-06-26 08:12:35,934 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1527462.0, ans=0.0 2023-06-26 08:12:38,246 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.37 vs. limit=15.0 2023-06-26 08:12:51,640 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 08:13:09,420 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1527522.0, ans=0.0 2023-06-26 08:13:44,631 INFO [train.py:996] (2/4) Epoch 9, batch 10650, loss[loss=0.2579, simple_loss=0.38, pruned_loss=0.06787, over 19798.00 frames. ], tot_loss[loss=0.2049, simple_loss=0.2814, pruned_loss=0.0642, over 4239132.62 frames. ], batch size: 702, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 08:14:49,640 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.095e+02 4.930e+02 8.313e+02 1.262e+03 3.074e+03, threshold=1.663e+03, percent-clipped=34.0 2023-06-26 08:14:59,354 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1527822.0, ans=0.125 2023-06-26 08:14:59,366 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1527822.0, ans=0.1 2023-06-26 08:15:23,268 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.69 vs. limit=15.0 2023-06-26 08:15:34,243 INFO [train.py:996] (2/4) Epoch 9, batch 10700, loss[loss=0.1995, simple_loss=0.2785, pruned_loss=0.06027, over 21317.00 frames. ], tot_loss[loss=0.2035, simple_loss=0.2796, pruned_loss=0.06367, over 4240219.96 frames. ], batch size: 176, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 08:15:36,494 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1527942.0, ans=0.0 2023-06-26 08:16:01,529 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1528002.0, ans=0.125 2023-06-26 08:17:20,282 INFO [train.py:996] (2/4) Epoch 9, batch 10750, loss[loss=0.292, simple_loss=0.3756, pruned_loss=0.1042, over 21722.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.2905, pruned_loss=0.06742, over 4247119.50 frames. ], batch size: 441, lr: 3.31e-03, grad_scale: 8.0 2023-06-26 08:17:28,918 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.66 vs. limit=22.5 2023-06-26 08:18:33,239 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.381e+02 4.303e+02 6.075e+02 7.797e+02 1.997e+03, threshold=1.215e+03, percent-clipped=3.0 2023-06-26 08:18:51,524 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1528482.0, ans=0.125 2023-06-26 08:19:10,508 INFO [train.py:996] (2/4) Epoch 9, batch 10800, loss[loss=0.2475, simple_loss=0.3215, pruned_loss=0.0867, over 21493.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2955, pruned_loss=0.0685, over 4251256.80 frames. ], batch size: 211, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 08:19:13,073 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1528542.0, ans=0.125 2023-06-26 08:20:48,519 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1528782.0, ans=0.04949747468305833 2023-06-26 08:20:50,217 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1528782.0, ans=0.0 2023-06-26 08:20:57,501 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1528782.0, ans=0.04949747468305833 2023-06-26 08:21:00,973 INFO [train.py:996] (2/4) Epoch 9, batch 10850, loss[loss=0.1962, simple_loss=0.267, pruned_loss=0.06269, over 21199.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2954, pruned_loss=0.06865, over 4256005.06 frames. ], batch size: 159, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 08:21:03,532 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1528842.0, ans=0.04949747468305833 2023-06-26 08:22:14,467 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1529022.0, ans=0.2 2023-06-26 08:22:19,322 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.316e+02 4.810e+02 7.791e+02 1.214e+03 2.371e+03, threshold=1.558e+03, percent-clipped=23.0 2023-06-26 08:22:56,757 INFO [train.py:996] (2/4) Epoch 9, batch 10900, loss[loss=0.1973, simple_loss=0.2831, pruned_loss=0.05577, over 21381.00 frames. ], tot_loss[loss=0.2113, simple_loss=0.2892, pruned_loss=0.06668, over 4252400.71 frames. ], batch size: 194, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 08:24:44,118 INFO [train.py:996] (2/4) Epoch 9, batch 10950, loss[loss=0.2255, simple_loss=0.3021, pruned_loss=0.07446, over 20682.00 frames. ], tot_loss[loss=0.2078, simple_loss=0.2856, pruned_loss=0.06503, over 4261864.78 frames. ], batch size: 607, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 08:24:44,676 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1529442.0, ans=0.5 2023-06-26 08:24:53,631 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1529442.0, ans=0.2 2023-06-26 08:25:30,485 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1529562.0, ans=0.0 2023-06-26 08:25:37,227 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1529562.0, ans=0.0 2023-06-26 08:25:55,412 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.406e+02 4.859e+02 7.093e+02 1.092e+03 2.550e+03, threshold=1.419e+03, percent-clipped=10.0 2023-06-26 08:26:26,616 INFO [train.py:996] (2/4) Epoch 9, batch 11000, loss[loss=0.2004, simple_loss=0.2758, pruned_loss=0.06255, over 21499.00 frames. ], tot_loss[loss=0.2077, simple_loss=0.2847, pruned_loss=0.06535, over 4268559.14 frames. ], batch size: 212, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 08:26:34,161 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_ff3.min_abs, batch_count=1529742.0, ans=0.2 2023-06-26 08:26:53,591 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1529802.0, ans=0.2 2023-06-26 08:27:22,119 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1529862.0, ans=0.0 2023-06-26 08:27:23,540 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=1529862.0, ans=0.025 2023-06-26 08:27:31,307 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.15 vs. limit=12.0 2023-06-26 08:27:59,010 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.16 vs. limit=15.0 2023-06-26 08:28:00,758 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.29 vs. limit=15.0 2023-06-26 08:28:20,364 INFO [train.py:996] (2/4) Epoch 9, batch 11050, loss[loss=0.1982, simple_loss=0.256, pruned_loss=0.07024, over 21576.00 frames. ], tot_loss[loss=0.2083, simple_loss=0.2823, pruned_loss=0.06717, over 4273559.04 frames. ], batch size: 414, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 08:29:32,133 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.162e+02 4.865e+02 7.286e+02 1.085e+03 1.953e+03, threshold=1.457e+03, percent-clipped=8.0 2023-06-26 08:29:36,378 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1530222.0, ans=0.0 2023-06-26 08:30:03,335 INFO [train.py:996] (2/4) Epoch 9, batch 11100, loss[loss=0.2123, simple_loss=0.2733, pruned_loss=0.07564, over 21551.00 frames. ], tot_loss[loss=0.207, simple_loss=0.2802, pruned_loss=0.06686, over 4266529.42 frames. ], batch size: 441, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 08:30:16,534 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1530342.0, ans=0.125 2023-06-26 08:31:03,786 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1530462.0, ans=0.1 2023-06-26 08:31:35,260 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.57 vs. limit=22.5 2023-06-26 08:31:39,767 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1530582.0, ans=0.2 2023-06-26 08:31:41,783 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 08:31:56,860 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1530642.0, ans=0.125 2023-06-26 08:31:56,889 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1530642.0, ans=0.125 2023-06-26 08:31:57,809 INFO [train.py:996] (2/4) Epoch 9, batch 11150, loss[loss=0.2062, simple_loss=0.2794, pruned_loss=0.06646, over 21846.00 frames. ], tot_loss[loss=0.2059, simple_loss=0.2788, pruned_loss=0.06648, over 4259054.83 frames. ], batch size: 107, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 08:32:19,707 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1530702.0, ans=0.125 2023-06-26 08:33:09,611 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.403e+02 4.594e+02 7.408e+02 1.103e+03 2.164e+03, threshold=1.482e+03, percent-clipped=12.0 2023-06-26 08:33:40,345 INFO [train.py:996] (2/4) Epoch 9, batch 11200, loss[loss=0.1815, simple_loss=0.2567, pruned_loss=0.05316, over 21699.00 frames. ], tot_loss[loss=0.2048, simple_loss=0.2778, pruned_loss=0.06588, over 4262953.88 frames. ], batch size: 282, lr: 3.31e-03, grad_scale: 32.0 2023-06-26 08:33:47,911 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1530942.0, ans=0.125 2023-06-26 08:33:55,155 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1530942.0, ans=0.125 2023-06-26 08:34:06,344 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.95 vs. limit=15.0 2023-06-26 08:35:30,888 INFO [train.py:996] (2/4) Epoch 9, batch 11250, loss[loss=0.2447, simple_loss=0.3164, pruned_loss=0.08647, over 21907.00 frames. ], tot_loss[loss=0.205, simple_loss=0.2775, pruned_loss=0.06622, over 4263272.78 frames. ], batch size: 107, lr: 3.31e-03, grad_scale: 32.0 2023-06-26 08:36:30,148 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1531362.0, ans=0.125 2023-06-26 08:36:50,832 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.393e+02 4.914e+02 6.866e+02 9.264e+02 1.730e+03, threshold=1.373e+03, percent-clipped=7.0 2023-06-26 08:37:20,669 INFO [train.py:996] (2/4) Epoch 9, batch 11300, loss[loss=0.2225, simple_loss=0.2957, pruned_loss=0.07461, over 21759.00 frames. ], tot_loss[loss=0.2072, simple_loss=0.2801, pruned_loss=0.06715, over 4275425.44 frames. ], batch size: 389, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 08:37:52,413 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.40 vs. limit=8.0 2023-06-26 08:38:05,462 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1531602.0, ans=0.125 2023-06-26 08:38:42,739 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1531722.0, ans=0.1 2023-06-26 08:38:55,973 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.87 vs. limit=15.0 2023-06-26 08:39:16,350 INFO [train.py:996] (2/4) Epoch 9, batch 11350, loss[loss=0.2515, simple_loss=0.3298, pruned_loss=0.08657, over 21845.00 frames. ], tot_loss[loss=0.208, simple_loss=0.283, pruned_loss=0.0665, over 4279165.31 frames. ], batch size: 118, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 08:39:38,316 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1531902.0, ans=0.125 2023-06-26 08:40:02,203 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.73 vs. limit=15.0 2023-06-26 08:40:31,495 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.547e+02 4.947e+02 6.813e+02 1.038e+03 3.040e+03, threshold=1.363e+03, percent-clipped=13.0 2023-06-26 08:41:03,931 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1532082.0, ans=0.125 2023-06-26 08:41:05,550 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1532082.0, ans=0.1 2023-06-26 08:41:08,348 INFO [train.py:996] (2/4) Epoch 9, batch 11400, loss[loss=0.2221, simple_loss=0.2993, pruned_loss=0.07243, over 20659.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.2892, pruned_loss=0.0692, over 4271769.03 frames. ], batch size: 607, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 08:42:31,932 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.11 vs. limit=15.0 2023-06-26 08:42:38,405 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1532322.0, ans=0.125 2023-06-26 08:42:39,973 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1532382.0, ans=0.125 2023-06-26 08:43:04,968 INFO [train.py:996] (2/4) Epoch 9, batch 11450, loss[loss=0.2205, simple_loss=0.2967, pruned_loss=0.0722, over 21443.00 frames. ], tot_loss[loss=0.2122, simple_loss=0.289, pruned_loss=0.06771, over 4264941.22 frames. ], batch size: 131, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 08:43:30,467 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1532502.0, ans=0.125 2023-06-26 08:44:14,730 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.477e+02 5.112e+02 7.054e+02 1.112e+03 2.275e+03, threshold=1.411e+03, percent-clipped=15.0 2023-06-26 08:44:15,318 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1532622.0, ans=0.025 2023-06-26 08:44:33,149 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1532682.0, ans=0.125 2023-06-26 08:44:33,224 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1532682.0, ans=0.2 2023-06-26 08:45:01,358 INFO [train.py:996] (2/4) Epoch 9, batch 11500, loss[loss=0.1926, simple_loss=0.2899, pruned_loss=0.04766, over 21737.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.2923, pruned_loss=0.06849, over 4270032.88 frames. ], batch size: 298, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 08:45:07,189 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1532742.0, ans=0.2 2023-06-26 08:45:26,131 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1532802.0, ans=0.1 2023-06-26 08:45:35,078 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1532802.0, ans=0.125 2023-06-26 08:46:05,214 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.09 vs. limit=15.0 2023-06-26 08:46:15,237 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1532922.0, ans=0.0 2023-06-26 08:46:45,047 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.79 vs. limit=10.0 2023-06-26 08:46:53,161 INFO [train.py:996] (2/4) Epoch 9, batch 11550, loss[loss=0.1653, simple_loss=0.2283, pruned_loss=0.05117, over 16607.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.2991, pruned_loss=0.06901, over 4263233.68 frames. ], batch size: 61, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 08:47:21,576 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.41 vs. limit=15.0 2023-06-26 08:47:37,544 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1533162.0, ans=0.025 2023-06-26 08:48:07,798 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.13 vs. limit=15.0 2023-06-26 08:48:08,425 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.509e+02 5.865e+02 8.299e+02 1.163e+03 3.420e+03, threshold=1.660e+03, percent-clipped=18.0 2023-06-26 08:48:48,953 INFO [train.py:996] (2/4) Epoch 9, batch 11600, loss[loss=0.2385, simple_loss=0.3332, pruned_loss=0.07192, over 21393.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3181, pruned_loss=0.07271, over 4269895.53 frames. ], batch size: 194, lr: 3.31e-03, grad_scale: 32.0 2023-06-26 08:49:14,142 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.07 vs. limit=10.0 2023-06-26 08:49:28,114 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.61 vs. limit=22.5 2023-06-26 08:49:59,023 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1533522.0, ans=0.0 2023-06-26 08:50:28,368 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1533582.0, ans=0.125 2023-06-26 08:50:36,542 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1533642.0, ans=0.0 2023-06-26 08:50:37,993 INFO [train.py:996] (2/4) Epoch 9, batch 11650, loss[loss=0.2479, simple_loss=0.3354, pruned_loss=0.08015, over 21760.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3222, pruned_loss=0.07288, over 4275556.02 frames. ], batch size: 351, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 08:50:43,894 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1533642.0, ans=0.125 2023-06-26 08:51:28,810 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1533762.0, ans=0.0 2023-06-26 08:51:51,885 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1533822.0, ans=0.125 2023-06-26 08:51:52,916 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.214e+02 7.495e+02 1.149e+03 1.864e+03 4.386e+03, threshold=2.298e+03, percent-clipped=28.0 2023-06-26 08:51:56,038 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.03 vs. limit=15.0 2023-06-26 08:52:14,332 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1533882.0, ans=0.05 2023-06-26 08:52:26,017 INFO [train.py:996] (2/4) Epoch 9, batch 11700, loss[loss=0.1955, simple_loss=0.2608, pruned_loss=0.06517, over 21383.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.3115, pruned_loss=0.07195, over 4265034.51 frames. ], batch size: 389, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 08:53:11,740 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1534062.0, ans=0.1 2023-06-26 08:54:13,611 INFO [train.py:996] (2/4) Epoch 9, batch 11750, loss[loss=0.2061, simple_loss=0.2633, pruned_loss=0.07446, over 21392.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.3035, pruned_loss=0.07169, over 4264747.92 frames. ], batch size: 144, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 08:54:25,429 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.12 vs. limit=22.5 2023-06-26 08:54:43,457 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1534302.0, ans=0.1 2023-06-26 08:54:45,881 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.58 vs. limit=6.0 2023-06-26 08:55:01,741 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.12 vs. limit=6.0 2023-06-26 08:55:31,059 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.373e+02 4.368e+02 6.221e+02 1.023e+03 2.709e+03, threshold=1.244e+03, percent-clipped=2.0 2023-06-26 08:56:03,976 INFO [train.py:996] (2/4) Epoch 9, batch 11800, loss[loss=0.229, simple_loss=0.3277, pruned_loss=0.06514, over 19867.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.3048, pruned_loss=0.07297, over 4265311.71 frames. ], batch size: 704, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 08:56:15,386 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1534542.0, ans=0.125 2023-06-26 08:56:32,751 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1534602.0, ans=0.125 2023-06-26 08:57:14,973 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1534722.0, ans=0.125 2023-06-26 08:57:53,781 INFO [train.py:996] (2/4) Epoch 9, batch 11850, loss[loss=0.2172, simple_loss=0.3104, pruned_loss=0.06199, over 21888.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.3053, pruned_loss=0.0715, over 4272485.47 frames. ], batch size: 316, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 08:57:58,550 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.12 vs. limit=22.5 2023-06-26 08:58:25,783 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1534902.0, ans=0.125 2023-06-26 08:58:58,368 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.31 vs. limit=15.0 2023-06-26 08:58:59,400 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1534962.0, ans=0.125 2023-06-26 08:59:16,064 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.272e+02 4.313e+02 5.764e+02 8.343e+02 1.784e+03, threshold=1.153e+03, percent-clipped=5.0 2023-06-26 08:59:38,501 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.09 vs. limit=10.0 2023-06-26 08:59:50,218 INFO [train.py:996] (2/4) Epoch 9, batch 11900, loss[loss=0.2241, simple_loss=0.2981, pruned_loss=0.07506, over 21811.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.305, pruned_loss=0.06936, over 4271967.04 frames. ], batch size: 102, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 09:00:33,542 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.12 vs. limit=6.0 2023-06-26 09:01:22,591 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1535382.0, ans=0.0 2023-06-26 09:01:36,244 INFO [train.py:996] (2/4) Epoch 9, batch 11950, loss[loss=0.2037, simple_loss=0.2985, pruned_loss=0.05444, over 21653.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.3044, pruned_loss=0.06672, over 4269021.83 frames. ], batch size: 389, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 09:02:16,239 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 09:02:50,739 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.248e+02 4.636e+02 6.640e+02 1.069e+03 2.597e+03, threshold=1.328e+03, percent-clipped=19.0 2023-06-26 09:03:23,562 INFO [train.py:996] (2/4) Epoch 9, batch 12000, loss[loss=0.1753, simple_loss=0.2477, pruned_loss=0.05143, over 21609.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.2987, pruned_loss=0.06475, over 4273421.81 frames. ], batch size: 247, lr: 3.31e-03, grad_scale: 32.0 2023-06-26 09:03:23,563 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-26 09:03:38,805 INFO [zipformer.py:1728] (2/4) name=encoder.encoders.1.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([5.0467, 2.4564, 4.3523, 2.5282], device='cuda:2') 2023-06-26 09:03:41,744 INFO [train.py:1028] (2/4) Epoch 9, validation: loss=0.2638, simple_loss=0.3517, pruned_loss=0.08798, over 1796401.00 frames. 2023-06-26 09:03:41,745 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 23731MB 2023-06-26 09:04:11,593 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1535802.0, ans=0.0 2023-06-26 09:04:29,274 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.22 vs. limit=22.5 2023-06-26 09:04:33,859 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1535862.0, ans=0.125 2023-06-26 09:05:31,710 INFO [train.py:996] (2/4) Epoch 9, batch 12050, loss[loss=0.2268, simple_loss=0.2935, pruned_loss=0.08005, over 21337.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.295, pruned_loss=0.06632, over 4275330.37 frames. ], batch size: 143, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 09:05:32,225 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1536042.0, ans=0.1 2023-06-26 09:05:46,911 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.92 vs. limit=15.0 2023-06-26 09:06:22,461 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.93 vs. limit=5.0 2023-06-26 09:06:27,517 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.17 vs. limit=15.0 2023-06-26 09:06:54,782 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.278e+02 4.979e+02 7.743e+02 1.300e+03 2.733e+03, threshold=1.549e+03, percent-clipped=23.0 2023-06-26 09:07:02,891 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 09:07:34,246 INFO [train.py:996] (2/4) Epoch 9, batch 12100, loss[loss=0.2732, simple_loss=0.4053, pruned_loss=0.0706, over 20803.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.3007, pruned_loss=0.07014, over 4283663.45 frames. ], batch size: 607, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 09:07:47,014 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.29 vs. limit=10.0 2023-06-26 09:08:30,753 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.81 vs. limit=15.0 2023-06-26 09:08:31,889 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1536462.0, ans=0.125 2023-06-26 09:08:56,772 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1536522.0, ans=0.04949747468305833 2023-06-26 09:09:06,294 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.88 vs. limit=15.0 2023-06-26 09:09:15,868 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1536582.0, ans=0.125 2023-06-26 09:09:24,922 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1536582.0, ans=0.1 2023-06-26 09:09:27,815 INFO [train.py:996] (2/4) Epoch 9, batch 12150, loss[loss=0.1844, simple_loss=0.2304, pruned_loss=0.06913, over 20711.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.3048, pruned_loss=0.0694, over 4273157.67 frames. ], batch size: 609, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 09:10:00,001 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1536702.0, ans=0.125 2023-06-26 09:10:43,934 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.377e+02 5.272e+02 8.352e+02 1.536e+03 2.585e+03, threshold=1.670e+03, percent-clipped=24.0 2023-06-26 09:11:19,132 INFO [train.py:996] (2/4) Epoch 9, batch 12200, loss[loss=0.1772, simple_loss=0.2394, pruned_loss=0.05749, over 21330.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.2989, pruned_loss=0.06827, over 4271113.78 frames. ], batch size: 160, lr: 3.30e-03, grad_scale: 16.0 2023-06-26 09:11:51,077 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1537002.0, ans=0.125 2023-06-26 09:12:03,442 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1537062.0, ans=0.0 2023-06-26 09:12:10,618 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1537062.0, ans=0.2 2023-06-26 09:12:46,708 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1537182.0, ans=0.0 2023-06-26 09:13:06,918 INFO [train.py:996] (2/4) Epoch 9, batch 12250, loss[loss=0.1682, simple_loss=0.254, pruned_loss=0.0412, over 21648.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.2909, pruned_loss=0.06571, over 4265146.84 frames. ], batch size: 414, lr: 3.30e-03, grad_scale: 16.0 2023-06-26 09:13:46,171 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.13 vs. limit=10.0 2023-06-26 09:13:54,946 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.85 vs. limit=6.0 2023-06-26 09:14:12,897 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.985e+02 4.210e+02 5.762e+02 8.754e+02 2.023e+03, threshold=1.152e+03, percent-clipped=2.0 2023-06-26 09:14:13,632 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1537422.0, ans=0.125 2023-06-26 09:14:20,959 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=1537422.0, ans=10.0 2023-06-26 09:14:38,215 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1537482.0, ans=0.0 2023-06-26 09:14:55,452 INFO [train.py:996] (2/4) Epoch 9, batch 12300, loss[loss=0.1651, simple_loss=0.2494, pruned_loss=0.04039, over 21305.00 frames. ], tot_loss[loss=0.202, simple_loss=0.283, pruned_loss=0.06052, over 4253210.81 frames. ], batch size: 176, lr: 3.30e-03, grad_scale: 16.0 2023-06-26 09:15:10,458 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1537542.0, ans=0.125 2023-06-26 09:15:14,342 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.44 vs. limit=15.0 2023-06-26 09:15:40,511 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.69 vs. limit=15.0 2023-06-26 09:15:41,559 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1537662.0, ans=0.2 2023-06-26 09:16:23,864 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.24 vs. limit=15.0 2023-06-26 09:16:24,999 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1537782.0, ans=0.0 2023-06-26 09:16:42,690 INFO [train.py:996] (2/4) Epoch 9, batch 12350, loss[loss=0.215, simple_loss=0.2963, pruned_loss=0.06688, over 21653.00 frames. ], tot_loss[loss=0.2053, simple_loss=0.2881, pruned_loss=0.06129, over 4252810.89 frames. ], batch size: 230, lr: 3.30e-03, grad_scale: 16.0 2023-06-26 09:17:47,950 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.420e+02 5.624e+02 9.354e+02 1.463e+03 3.322e+03, threshold=1.871e+03, percent-clipped=32.0 2023-06-26 09:18:29,217 INFO [train.py:996] (2/4) Epoch 9, batch 12400, loss[loss=0.2107, simple_loss=0.2769, pruned_loss=0.07226, over 21547.00 frames. ], tot_loss[loss=0.2095, simple_loss=0.29, pruned_loss=0.06446, over 4264254.49 frames. ], batch size: 194, lr: 3.30e-03, grad_scale: 32.0 2023-06-26 09:18:31,636 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1538142.0, ans=0.125 2023-06-26 09:19:18,812 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1538262.0, ans=0.125 2023-06-26 09:19:43,643 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1538322.0, ans=0.05 2023-06-26 09:20:01,356 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1538382.0, ans=0.0 2023-06-26 09:20:12,321 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1538382.0, ans=0.125 2023-06-26 09:20:18,879 INFO [train.py:996] (2/4) Epoch 9, batch 12450, loss[loss=0.2459, simple_loss=0.3168, pruned_loss=0.08755, over 21394.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2943, pruned_loss=0.06797, over 4273803.11 frames. ], batch size: 176, lr: 3.30e-03, grad_scale: 32.0 2023-06-26 09:21:12,084 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1538562.0, ans=0.125 2023-06-26 09:21:32,016 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1538622.0, ans=0.1 2023-06-26 09:21:39,004 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1538622.0, ans=0.125 2023-06-26 09:21:43,225 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.605e+02 6.014e+02 7.920e+02 1.251e+03 2.737e+03, threshold=1.584e+03, percent-clipped=3.0 2023-06-26 09:22:15,993 INFO [train.py:996] (2/4) Epoch 9, batch 12500, loss[loss=0.234, simple_loss=0.3295, pruned_loss=0.06928, over 21603.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.3031, pruned_loss=0.07078, over 4266310.70 frames. ], batch size: 230, lr: 3.30e-03, grad_scale: 32.0 2023-06-26 09:22:56,396 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.74 vs. limit=15.0 2023-06-26 09:23:28,223 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1538922.0, ans=0.2 2023-06-26 09:24:02,891 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1538982.0, ans=0.05 2023-06-26 09:24:02,948 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1538982.0, ans=0.04949747468305833 2023-06-26 09:24:07,324 INFO [train.py:996] (2/4) Epoch 9, batch 12550, loss[loss=0.2323, simple_loss=0.3089, pruned_loss=0.07785, over 21678.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.3088, pruned_loss=0.07237, over 4265826.72 frames. ], batch size: 351, lr: 3.30e-03, grad_scale: 16.0 2023-06-26 09:25:32,849 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.614e+02 5.506e+02 7.478e+02 1.164e+03 2.448e+03, threshold=1.496e+03, percent-clipped=9.0 2023-06-26 09:25:49,612 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1539282.0, ans=0.0 2023-06-26 09:26:02,768 INFO [train.py:996] (2/4) Epoch 9, batch 12600, loss[loss=0.1762, simple_loss=0.2621, pruned_loss=0.04513, over 21380.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.3064, pruned_loss=0.07071, over 4258108.23 frames. ], batch size: 131, lr: 3.30e-03, grad_scale: 16.0 2023-06-26 09:26:04,327 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.01 vs. limit=15.0 2023-06-26 09:26:17,898 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1539342.0, ans=0.125 2023-06-26 09:26:26,657 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1539402.0, ans=0.2 2023-06-26 09:27:03,175 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1539462.0, ans=0.125 2023-06-26 09:27:50,963 INFO [train.py:996] (2/4) Epoch 9, batch 12650, loss[loss=0.2062, simple_loss=0.2736, pruned_loss=0.06943, over 21494.00 frames. ], tot_loss[loss=0.2164, simple_loss=0.2986, pruned_loss=0.06709, over 4260698.73 frames. ], batch size: 194, lr: 3.30e-03, grad_scale: 16.0 2023-06-26 09:27:53,076 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1539642.0, ans=0.0 2023-06-26 09:29:03,431 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=1539822.0, ans=15.0 2023-06-26 09:29:09,318 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.312e+02 4.812e+02 9.064e+02 1.405e+03 2.946e+03, threshold=1.813e+03, percent-clipped=21.0 2023-06-26 09:29:44,744 INFO [train.py:996] (2/4) Epoch 9, batch 12700, loss[loss=0.3024, simple_loss=0.362, pruned_loss=0.1214, over 21354.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.2996, pruned_loss=0.06968, over 4265074.25 frames. ], batch size: 507, lr: 3.30e-03, grad_scale: 16.0 2023-06-26 09:29:54,107 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1539942.0, ans=0.2 2023-06-26 09:30:09,474 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1540002.0, ans=0.0 2023-06-26 09:30:35,821 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1540062.0, ans=0.05 2023-06-26 09:31:02,706 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=3.78 vs. limit=15.0 2023-06-26 09:31:32,370 INFO [train.py:996] (2/4) Epoch 9, batch 12750, loss[loss=0.2064, simple_loss=0.2904, pruned_loss=0.06119, over 21424.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.3002, pruned_loss=0.06998, over 4271147.25 frames. ], batch size: 131, lr: 3.30e-03, grad_scale: 16.0 2023-06-26 09:32:35,486 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1540422.0, ans=0.0 2023-06-26 09:32:38,752 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1540422.0, ans=0.125 2023-06-26 09:32:44,313 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1540422.0, ans=0.0 2023-06-26 09:32:45,550 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.578e+02 5.161e+02 7.205e+02 9.772e+02 1.736e+03, threshold=1.441e+03, percent-clipped=0.0 2023-06-26 09:32:51,804 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.59 vs. limit=15.0 2023-06-26 09:33:19,529 INFO [train.py:996] (2/4) Epoch 9, batch 12800, loss[loss=0.2405, simple_loss=0.3131, pruned_loss=0.08394, over 21763.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.3002, pruned_loss=0.07073, over 4273282.87 frames. ], batch size: 414, lr: 3.30e-03, grad_scale: 32.0 2023-06-26 09:33:26,544 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1540542.0, ans=0.125 2023-06-26 09:34:16,332 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 09:35:05,947 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1540782.0, ans=0.5 2023-06-26 09:35:13,874 INFO [train.py:996] (2/4) Epoch 9, batch 12850, loss[loss=0.1937, simple_loss=0.279, pruned_loss=0.05418, over 21286.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.3028, pruned_loss=0.07199, over 4272901.12 frames. ], batch size: 176, lr: 3.30e-03, grad_scale: 32.0 2023-06-26 09:35:27,601 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.05 vs. limit=10.0 2023-06-26 09:35:29,132 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1540842.0, ans=0.0 2023-06-26 09:35:36,780 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.65 vs. limit=10.0 2023-06-26 09:36:32,080 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1541022.0, ans=0.125 2023-06-26 09:36:36,355 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.020e+02 4.565e+02 5.945e+02 7.206e+02 1.665e+03, threshold=1.189e+03, percent-clipped=1.0 2023-06-26 09:37:01,667 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1541082.0, ans=0.07 2023-06-26 09:37:04,598 INFO [train.py:996] (2/4) Epoch 9, batch 12900, loss[loss=0.1753, simple_loss=0.2526, pruned_loss=0.04899, over 21320.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.3007, pruned_loss=0.06902, over 4271200.23 frames. ], batch size: 176, lr: 3.30e-03, grad_scale: 16.0 2023-06-26 09:38:13,116 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1541322.0, ans=0.125 2023-06-26 09:38:53,025 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.11 vs. limit=6.0 2023-06-26 09:38:55,137 INFO [train.py:996] (2/4) Epoch 9, batch 12950, loss[loss=0.2455, simple_loss=0.3587, pruned_loss=0.06619, over 19767.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.299, pruned_loss=0.06714, over 4271173.44 frames. ], batch size: 703, lr: 3.30e-03, grad_scale: 16.0 2023-06-26 09:38:59,280 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1541442.0, ans=0.0 2023-06-26 09:39:33,482 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.64 vs. limit=22.5 2023-06-26 09:40:21,284 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.550e+02 5.457e+02 7.611e+02 1.240e+03 2.264e+03, threshold=1.522e+03, percent-clipped=25.0 2023-06-26 09:40:42,352 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1541742.0, ans=0.125 2023-06-26 09:40:43,298 INFO [train.py:996] (2/4) Epoch 9, batch 13000, loss[loss=0.1783, simple_loss=0.2577, pruned_loss=0.04945, over 21716.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.2996, pruned_loss=0.06805, over 4274044.48 frames. ], batch size: 124, lr: 3.30e-03, grad_scale: 16.0 2023-06-26 09:40:52,589 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1541742.0, ans=0.0 2023-06-26 09:41:43,573 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1541862.0, ans=0.0 2023-06-26 09:41:50,572 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1541922.0, ans=0.125 2023-06-26 09:42:31,836 INFO [train.py:996] (2/4) Epoch 9, batch 13050, loss[loss=0.2151, simple_loss=0.2849, pruned_loss=0.07259, over 21584.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.2954, pruned_loss=0.06641, over 4278667.78 frames. ], batch size: 195, lr: 3.30e-03, grad_scale: 16.0 2023-06-26 09:42:37,554 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1542042.0, ans=0.125 2023-06-26 09:43:46,640 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1542222.0, ans=0.1 2023-06-26 09:43:58,309 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.428e+02 4.464e+02 7.205e+02 1.000e+03 2.248e+03, threshold=1.441e+03, percent-clipped=5.0 2023-06-26 09:43:59,452 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.53 vs. limit=15.0 2023-06-26 09:44:05,015 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.98 vs. limit=6.0 2023-06-26 09:44:09,527 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1542282.0, ans=0.125 2023-06-26 09:44:09,625 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 09:44:21,923 INFO [train.py:996] (2/4) Epoch 9, batch 13100, loss[loss=0.259, simple_loss=0.3331, pruned_loss=0.09247, over 21246.00 frames. ], tot_loss[loss=0.2149, simple_loss=0.2964, pruned_loss=0.06669, over 4286231.76 frames. ], batch size: 143, lr: 3.30e-03, grad_scale: 16.0 2023-06-26 09:44:28,450 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.43 vs. limit=22.5 2023-06-26 09:44:28,537 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.85 vs. limit=15.0 2023-06-26 09:44:39,549 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1542342.0, ans=0.2 2023-06-26 09:44:41,358 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1542342.0, ans=0.125 2023-06-26 09:45:45,918 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1542522.0, ans=0.0 2023-06-26 09:46:06,152 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1542582.0, ans=0.2 2023-06-26 09:46:20,410 INFO [train.py:996] (2/4) Epoch 9, batch 13150, loss[loss=0.1709, simple_loss=0.2499, pruned_loss=0.04596, over 21376.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.2992, pruned_loss=0.06913, over 4286287.16 frames. ], batch size: 211, lr: 3.30e-03, grad_scale: 16.0 2023-06-26 09:46:42,979 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1542642.0, ans=0.1 2023-06-26 09:47:40,691 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1542822.0, ans=0.0 2023-06-26 09:47:43,985 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.265e+02 6.125e+02 9.524e+02 1.520e+03 3.301e+03, threshold=1.905e+03, percent-clipped=27.0 2023-06-26 09:48:24,348 INFO [train.py:996] (2/4) Epoch 9, batch 13200, loss[loss=0.2376, simple_loss=0.3086, pruned_loss=0.0833, over 21508.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.2968, pruned_loss=0.06905, over 4279042.63 frames. ], batch size: 194, lr: 3.30e-03, grad_scale: 32.0 2023-06-26 09:48:25,261 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1542942.0, ans=0.05 2023-06-26 09:48:54,752 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1543002.0, ans=0.125 2023-06-26 09:49:29,329 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1543122.0, ans=0.2 2023-06-26 09:49:37,438 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.71 vs. limit=15.0 2023-06-26 09:49:45,588 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1543182.0, ans=0.125 2023-06-26 09:49:54,663 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1543182.0, ans=0.0 2023-06-26 09:50:16,147 INFO [train.py:996] (2/4) Epoch 9, batch 13250, loss[loss=0.2028, simple_loss=0.2769, pruned_loss=0.06437, over 21294.00 frames. ], tot_loss[loss=0.219, simple_loss=0.2967, pruned_loss=0.07062, over 4290473.45 frames. ], batch size: 176, lr: 3.30e-03, grad_scale: 16.0 2023-06-26 09:51:13,514 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1543422.0, ans=0.125 2023-06-26 09:51:48,038 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.553e+02 4.712e+02 6.598e+02 9.234e+02 1.581e+03, threshold=1.320e+03, percent-clipped=0.0 2023-06-26 09:52:02,355 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=12.27 vs. limit=15.0 2023-06-26 09:52:06,891 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1543542.0, ans=0.125 2023-06-26 09:52:13,068 INFO [train.py:996] (2/4) Epoch 9, batch 13300, loss[loss=0.2156, simple_loss=0.3291, pruned_loss=0.05106, over 21255.00 frames. ], tot_loss[loss=0.22, simple_loss=0.2995, pruned_loss=0.07019, over 4287304.25 frames. ], batch size: 548, lr: 3.30e-03, grad_scale: 8.0 2023-06-26 09:53:38,346 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn2.whiten.whitening_limit, batch_count=1543722.0, ans=22.5 2023-06-26 09:53:39,497 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1543782.0, ans=0.1 2023-06-26 09:54:02,886 INFO [train.py:996] (2/4) Epoch 9, batch 13350, loss[loss=0.2433, simple_loss=0.3232, pruned_loss=0.08172, over 21807.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.3051, pruned_loss=0.07276, over 4286896.56 frames. ], batch size: 118, lr: 3.30e-03, grad_scale: 8.0 2023-06-26 09:55:23,075 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1544022.0, ans=0.1 2023-06-26 09:55:24,767 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1544022.0, ans=0.125 2023-06-26 09:55:27,616 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.028e+02 5.402e+02 7.933e+02 1.042e+03 2.169e+03, threshold=1.587e+03, percent-clipped=13.0 2023-06-26 09:55:51,781 INFO [train.py:996] (2/4) Epoch 9, batch 13400, loss[loss=0.2695, simple_loss=0.3353, pruned_loss=0.1018, over 21535.00 frames. ], tot_loss[loss=0.227, simple_loss=0.3054, pruned_loss=0.07426, over 4282685.80 frames. ], batch size: 471, lr: 3.30e-03, grad_scale: 8.0 2023-06-26 09:56:45,308 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1544262.0, ans=0.0 2023-06-26 09:56:45,320 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1544262.0, ans=0.125 2023-06-26 09:56:47,929 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.99 vs. limit=15.0 2023-06-26 09:57:02,221 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1544322.0, ans=0.0 2023-06-26 09:57:16,349 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1544322.0, ans=0.125 2023-06-26 09:57:19,801 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1544382.0, ans=0.125 2023-06-26 09:57:34,293 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1544382.0, ans=0.0 2023-06-26 09:57:36,297 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1544382.0, ans=0.125 2023-06-26 09:57:39,190 INFO [train.py:996] (2/4) Epoch 9, batch 13450, loss[loss=0.2209, simple_loss=0.2958, pruned_loss=0.073, over 21637.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.3053, pruned_loss=0.07618, over 4277535.33 frames. ], batch size: 415, lr: 3.30e-03, grad_scale: 8.0 2023-06-26 09:58:25,267 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1544502.0, ans=0.0 2023-06-26 09:59:01,572 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.85 vs. limit=15.0 2023-06-26 09:59:10,546 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.398e+02 5.052e+02 6.156e+02 8.765e+02 1.835e+03, threshold=1.231e+03, percent-clipped=4.0 2023-06-26 09:59:30,352 INFO [train.py:996] (2/4) Epoch 9, batch 13500, loss[loss=0.213, simple_loss=0.2878, pruned_loss=0.06911, over 21732.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.2997, pruned_loss=0.07443, over 4270490.63 frames. ], batch size: 298, lr: 3.30e-03, grad_scale: 8.0 2023-06-26 10:01:27,148 INFO [train.py:996] (2/4) Epoch 9, batch 13550, loss[loss=0.1949, simple_loss=0.286, pruned_loss=0.05194, over 20968.00 frames. ], tot_loss[loss=0.2248, simple_loss=0.3028, pruned_loss=0.0734, over 4264989.10 frames. ], batch size: 607, lr: 3.30e-03, grad_scale: 8.0 2023-06-26 10:01:33,916 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.36 vs. limit=10.0 2023-06-26 10:01:35,398 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1545042.0, ans=0.1 2023-06-26 10:01:54,324 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1545102.0, ans=0.035 2023-06-26 10:02:15,748 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1545162.0, ans=0.125 2023-06-26 10:02:33,252 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1545162.0, ans=0.125 2023-06-26 10:02:45,310 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1545222.0, ans=0.125 2023-06-26 10:02:51,993 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.799e+02 5.933e+02 9.358e+02 1.476e+03 2.986e+03, threshold=1.872e+03, percent-clipped=34.0 2023-06-26 10:03:16,851 INFO [train.py:996] (2/4) Epoch 9, batch 13600, loss[loss=0.2072, simple_loss=0.2944, pruned_loss=0.05996, over 21756.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.3038, pruned_loss=0.07283, over 4270994.13 frames. ], batch size: 112, lr: 3.30e-03, grad_scale: 16.0 2023-06-26 10:03:59,945 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1545402.0, ans=0.07 2023-06-26 10:04:01,709 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1545462.0, ans=0.0 2023-06-26 10:04:18,272 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1545462.0, ans=0.125 2023-06-26 10:04:23,453 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1545522.0, ans=0.2 2023-06-26 10:04:33,725 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1545522.0, ans=0.125 2023-06-26 10:04:59,340 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1545582.0, ans=0.1 2023-06-26 10:05:01,286 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1545582.0, ans=0.125 2023-06-26 10:05:04,171 INFO [train.py:996] (2/4) Epoch 9, batch 13650, loss[loss=0.198, simple_loss=0.2679, pruned_loss=0.06402, over 16221.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.2984, pruned_loss=0.07046, over 4267295.47 frames. ], batch size: 66, lr: 3.30e-03, grad_scale: 16.0 2023-06-26 10:05:28,236 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1545702.0, ans=0.0 2023-06-26 10:06:01,708 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1545762.0, ans=0.0 2023-06-26 10:06:15,753 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1545822.0, ans=0.125 2023-06-26 10:06:17,993 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.60 vs. limit=22.5 2023-06-26 10:06:22,700 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1545822.0, ans=0.125 2023-06-26 10:06:23,707 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.401e+02 4.955e+02 6.723e+02 8.963e+02 2.035e+03, threshold=1.345e+03, percent-clipped=1.0 2023-06-26 10:06:48,920 INFO [train.py:996] (2/4) Epoch 9, batch 13700, loss[loss=0.1626, simple_loss=0.2331, pruned_loss=0.04603, over 21729.00 frames. ], tot_loss[loss=0.2164, simple_loss=0.2939, pruned_loss=0.06947, over 4263410.16 frames. ], batch size: 112, lr: 3.30e-03, grad_scale: 16.0 2023-06-26 10:07:00,273 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1545942.0, ans=0.2 2023-06-26 10:07:36,260 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1546062.0, ans=0.125 2023-06-26 10:08:45,494 INFO [train.py:996] (2/4) Epoch 9, batch 13750, loss[loss=0.1721, simple_loss=0.2389, pruned_loss=0.0526, over 21391.00 frames. ], tot_loss[loss=0.2149, simple_loss=0.2909, pruned_loss=0.06942, over 4269734.51 frames. ], batch size: 194, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 10:09:07,697 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1546302.0, ans=0.125 2023-06-26 10:09:15,166 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1546302.0, ans=0.0 2023-06-26 10:09:46,148 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1546422.0, ans=0.2 2023-06-26 10:10:16,982 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.609e+02 6.154e+02 1.114e+03 1.508e+03 3.073e+03, threshold=2.228e+03, percent-clipped=34.0 2023-06-26 10:10:23,030 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1546482.0, ans=0.125 2023-06-26 10:10:41,608 INFO [train.py:996] (2/4) Epoch 9, batch 13800, loss[loss=0.2077, simple_loss=0.3111, pruned_loss=0.05215, over 21696.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.2906, pruned_loss=0.06727, over 4271424.66 frames. ], batch size: 247, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 10:10:42,429 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 10:11:18,693 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1546662.0, ans=0.125 2023-06-26 10:11:23,841 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1546662.0, ans=0.125 2023-06-26 10:12:27,967 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1546782.0, ans=0.025 2023-06-26 10:12:32,883 INFO [train.py:996] (2/4) Epoch 9, batch 13850, loss[loss=0.24, simple_loss=0.3265, pruned_loss=0.07677, over 21872.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.298, pruned_loss=0.06855, over 4275442.88 frames. ], batch size: 371, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 10:12:53,252 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1546902.0, ans=0.125 2023-06-26 10:13:05,330 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=1546902.0, ans=0.025 2023-06-26 10:13:51,287 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=1547022.0, ans=0.5 2023-06-26 10:13:56,605 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff3.min_abs, batch_count=1547022.0, ans=0.2 2023-06-26 10:13:57,552 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.808e+02 5.512e+02 9.000e+02 1.173e+03 2.021e+03, threshold=1.800e+03, percent-clipped=1.0 2023-06-26 10:14:22,469 INFO [train.py:996] (2/4) Epoch 9, batch 13900, loss[loss=0.1949, simple_loss=0.2964, pruned_loss=0.04668, over 20859.00 frames. ], tot_loss[loss=0.2219, simple_loss=0.3022, pruned_loss=0.0708, over 4281568.42 frames. ], batch size: 608, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 10:14:24,949 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1547142.0, ans=0.125 2023-06-26 10:14:32,060 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1547142.0, ans=0.0 2023-06-26 10:15:37,199 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1547322.0, ans=0.125 2023-06-26 10:15:38,906 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1547322.0, ans=0.125 2023-06-26 10:15:46,026 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1547322.0, ans=0.125 2023-06-26 10:16:11,143 INFO [train.py:996] (2/4) Epoch 9, batch 13950, loss[loss=0.2276, simple_loss=0.2978, pruned_loss=0.07875, over 21650.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.3017, pruned_loss=0.07289, over 4284946.13 frames. ], batch size: 263, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 10:16:50,641 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.93 vs. limit=12.0 2023-06-26 10:17:34,995 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.697e+02 5.601e+02 7.890e+02 1.100e+03 2.147e+03, threshold=1.578e+03, percent-clipped=2.0 2023-06-26 10:17:58,867 INFO [train.py:996] (2/4) Epoch 9, batch 14000, loss[loss=0.1989, simple_loss=0.2994, pruned_loss=0.04919, over 21799.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.3006, pruned_loss=0.07098, over 4275679.43 frames. ], batch size: 332, lr: 3.29e-03, grad_scale: 32.0 2023-06-26 10:19:46,299 INFO [train.py:996] (2/4) Epoch 9, batch 14050, loss[loss=0.2123, simple_loss=0.2738, pruned_loss=0.07539, over 21292.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2961, pruned_loss=0.06768, over 4281156.19 frames. ], batch size: 471, lr: 3.29e-03, grad_scale: 32.0 2023-06-26 10:19:51,849 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1548042.0, ans=0.0 2023-06-26 10:20:04,388 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1548102.0, ans=0.125 2023-06-26 10:20:16,656 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1548102.0, ans=0.125 2023-06-26 10:20:28,753 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1548162.0, ans=0.125 2023-06-26 10:20:30,668 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1548162.0, ans=0.1 2023-06-26 10:21:00,168 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.70 vs. limit=15.0 2023-06-26 10:21:06,332 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.027e+02 4.796e+02 7.490e+02 1.046e+03 2.202e+03, threshold=1.498e+03, percent-clipped=4.0 2023-06-26 10:21:06,983 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1548282.0, ans=0.1 2023-06-26 10:21:13,929 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1548282.0, ans=0.0 2023-06-26 10:21:16,323 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.54 vs. limit=15.0 2023-06-26 10:21:23,149 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.03 vs. limit=10.0 2023-06-26 10:21:30,909 INFO [train.py:996] (2/4) Epoch 9, batch 14100, loss[loss=0.2212, simple_loss=0.2931, pruned_loss=0.07465, over 21675.00 frames. ], tot_loss[loss=0.2122, simple_loss=0.2902, pruned_loss=0.06711, over 4286652.92 frames. ], batch size: 441, lr: 3.29e-03, grad_scale: 32.0 2023-06-26 10:22:02,088 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.05 vs. limit=15.0 2023-06-26 10:23:04,135 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.25 vs. limit=10.0 2023-06-26 10:23:18,199 INFO [train.py:996] (2/4) Epoch 9, batch 14150, loss[loss=0.2394, simple_loss=0.3166, pruned_loss=0.08114, over 21497.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.2935, pruned_loss=0.06786, over 4282946.77 frames. ], batch size: 160, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 10:23:22,871 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=12.75 vs. limit=15.0 2023-06-26 10:23:56,565 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1548762.0, ans=0.1 2023-06-26 10:24:32,966 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_ff2.min_abs, batch_count=1548822.0, ans=0.1 2023-06-26 10:24:42,652 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.188e+02 5.825e+02 9.276e+02 1.325e+03 2.479e+03, threshold=1.855e+03, percent-clipped=15.0 2023-06-26 10:24:59,289 INFO [train.py:996] (2/4) Epoch 9, batch 14200, loss[loss=0.1945, simple_loss=0.2642, pruned_loss=0.06241, over 21770.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2937, pruned_loss=0.06769, over 4280457.52 frames. ], batch size: 316, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 10:25:00,744 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.07 vs. limit=6.0 2023-06-26 10:25:01,843 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1548942.0, ans=0.125 2023-06-26 10:25:12,036 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1548942.0, ans=0.125 2023-06-26 10:26:21,771 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1549122.0, ans=0.125 2023-06-26 10:26:36,424 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1549182.0, ans=0.2 2023-06-26 10:26:47,082 INFO [train.py:996] (2/4) Epoch 9, batch 14250, loss[loss=0.1953, simple_loss=0.2793, pruned_loss=0.05561, over 21681.00 frames. ], tot_loss[loss=0.2115, simple_loss=0.2882, pruned_loss=0.06742, over 4279159.15 frames. ], batch size: 415, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 10:27:51,952 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.19 vs. limit=15.0 2023-06-26 10:28:07,378 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1549422.0, ans=0.0 2023-06-26 10:28:19,575 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1549482.0, ans=0.125 2023-06-26 10:28:22,815 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.193e+02 4.868e+02 6.668e+02 9.362e+02 2.470e+03, threshold=1.334e+03, percent-clipped=6.0 2023-06-26 10:28:43,666 INFO [train.py:996] (2/4) Epoch 9, batch 14300, loss[loss=0.3482, simple_loss=0.4358, pruned_loss=0.1303, over 21521.00 frames. ], tot_loss[loss=0.2133, simple_loss=0.2903, pruned_loss=0.06811, over 4272087.43 frames. ], batch size: 471, lr: 3.29e-03, grad_scale: 8.0 2023-06-26 10:28:53,079 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1549542.0, ans=0.125 2023-06-26 10:29:00,494 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1549602.0, ans=0.0 2023-06-26 10:29:06,625 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.30 vs. limit=12.0 2023-06-26 10:29:07,664 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1549602.0, ans=0.125 2023-06-26 10:30:23,977 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1549782.0, ans=0.0 2023-06-26 10:30:33,267 INFO [train.py:996] (2/4) Epoch 9, batch 14350, loss[loss=0.2174, simple_loss=0.2966, pruned_loss=0.06911, over 21871.00 frames. ], tot_loss[loss=0.2149, simple_loss=0.2943, pruned_loss=0.06777, over 4273473.21 frames. ], batch size: 371, lr: 3.29e-03, grad_scale: 8.0 2023-06-26 10:30:41,679 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.32 vs. limit=15.0 2023-06-26 10:31:38,943 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1549962.0, ans=0.0 2023-06-26 10:31:54,400 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1550022.0, ans=0.125 2023-06-26 10:32:00,443 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.420e+02 5.713e+02 8.636e+02 1.390e+03 3.076e+03, threshold=1.727e+03, percent-clipped=28.0 2023-06-26 10:32:14,975 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1550142.0, ans=0.0 2023-06-26 10:32:21,222 INFO [train.py:996] (2/4) Epoch 9, batch 14400, loss[loss=0.2097, simple_loss=0.2766, pruned_loss=0.0714, over 21643.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2915, pruned_loss=0.06784, over 4272526.56 frames. ], batch size: 332, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 10:33:24,243 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1550262.0, ans=0.035 2023-06-26 10:33:45,179 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=1550382.0, ans=15.0 2023-06-26 10:34:03,164 INFO [train.py:996] (2/4) Epoch 9, batch 14450, loss[loss=0.1751, simple_loss=0.2497, pruned_loss=0.0502, over 21637.00 frames. ], tot_loss[loss=0.2115, simple_loss=0.2872, pruned_loss=0.06786, over 4266943.33 frames. ], batch size: 247, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 10:34:49,263 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.85 vs. limit=15.0 2023-06-26 10:35:36,888 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.305e+02 4.619e+02 5.727e+02 8.380e+02 1.480e+03, threshold=1.145e+03, percent-clipped=0.0 2023-06-26 10:35:56,854 INFO [train.py:996] (2/4) Epoch 9, batch 14500, loss[loss=0.2065, simple_loss=0.293, pruned_loss=0.05999, over 21402.00 frames. ], tot_loss[loss=0.2092, simple_loss=0.2843, pruned_loss=0.06706, over 4265853.73 frames. ], batch size: 194, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 10:36:27,320 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1550802.0, ans=0.2 2023-06-26 10:36:41,600 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.40 vs. limit=15.0 2023-06-26 10:37:26,044 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1550982.0, ans=0.125 2023-06-26 10:37:46,757 INFO [train.py:996] (2/4) Epoch 9, batch 14550, loss[loss=0.2384, simple_loss=0.3206, pruned_loss=0.07809, over 21903.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.2887, pruned_loss=0.06841, over 4254529.88 frames. ], batch size: 316, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 10:39:00,209 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1551222.0, ans=0.2 2023-06-26 10:39:20,549 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.787e+02 5.550e+02 7.546e+02 1.212e+03 2.573e+03, threshold=1.509e+03, percent-clipped=29.0 2023-06-26 10:39:35,759 INFO [train.py:996] (2/4) Epoch 9, batch 14600, loss[loss=0.3082, simple_loss=0.3607, pruned_loss=0.1279, over 21334.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.2968, pruned_loss=0.07166, over 4262053.77 frames. ], batch size: 507, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 10:39:45,199 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1551342.0, ans=0.0 2023-06-26 10:40:10,749 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1551402.0, ans=0.125 2023-06-26 10:40:39,160 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1551462.0, ans=0.125 2023-06-26 10:40:51,726 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1551522.0, ans=10.0 2023-06-26 10:41:24,132 INFO [train.py:996] (2/4) Epoch 9, batch 14650, loss[loss=0.1774, simple_loss=0.2748, pruned_loss=0.03998, over 21616.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.3009, pruned_loss=0.07134, over 4258978.25 frames. ], batch size: 389, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 10:41:28,583 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.60 vs. limit=15.0 2023-06-26 10:41:38,768 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1551642.0, ans=0.04949747468305833 2023-06-26 10:41:57,679 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1551702.0, ans=0.125 2023-06-26 10:42:43,357 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1551882.0, ans=0.025 2023-06-26 10:42:46,541 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.116e+02 4.456e+02 7.843e+02 1.118e+03 1.924e+03, threshold=1.569e+03, percent-clipped=10.0 2023-06-26 10:43:00,880 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1551882.0, ans=0.1 2023-06-26 10:43:07,370 INFO [train.py:996] (2/4) Epoch 9, batch 14700, loss[loss=0.2199, simple_loss=0.3245, pruned_loss=0.05764, over 21238.00 frames. ], tot_loss[loss=0.2133, simple_loss=0.2939, pruned_loss=0.06638, over 4258443.53 frames. ], batch size: 548, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 10:43:22,140 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1551942.0, ans=0.125 2023-06-26 10:44:08,574 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1552062.0, ans=0.0 2023-06-26 10:44:31,836 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1552122.0, ans=0.2 2023-06-26 10:44:43,009 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1552182.0, ans=0.0 2023-06-26 10:44:52,753 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.08 vs. limit=15.0 2023-06-26 10:44:58,843 INFO [train.py:996] (2/4) Epoch 9, batch 14750, loss[loss=0.3159, simple_loss=0.3936, pruned_loss=0.1191, over 21298.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.2993, pruned_loss=0.06914, over 4263593.51 frames. ], batch size: 548, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 10:45:16,689 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1552242.0, ans=0.0 2023-06-26 10:45:53,102 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1552362.0, ans=0.125 2023-06-26 10:46:34,196 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.559e+02 5.782e+02 7.997e+02 1.225e+03 2.854e+03, threshold=1.599e+03, percent-clipped=14.0 2023-06-26 10:46:46,136 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1552482.0, ans=0.0 2023-06-26 10:46:55,541 INFO [train.py:996] (2/4) Epoch 9, batch 14800, loss[loss=0.3016, simple_loss=0.3521, pruned_loss=0.1255, over 21383.00 frames. ], tot_loss[loss=0.23, simple_loss=0.3104, pruned_loss=0.07478, over 4262099.99 frames. ], batch size: 507, lr: 3.29e-03, grad_scale: 32.0 2023-06-26 10:47:25,648 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.68 vs. limit=15.0 2023-06-26 10:48:04,743 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1552722.0, ans=0.07 2023-06-26 10:48:36,036 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1552782.0, ans=0.125 2023-06-26 10:48:37,646 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1552782.0, ans=0.125 2023-06-26 10:48:59,093 INFO [train.py:996] (2/4) Epoch 9, batch 14850, loss[loss=0.1943, simple_loss=0.322, pruned_loss=0.03332, over 19853.00 frames. ], tot_loss[loss=0.2266, simple_loss=0.3044, pruned_loss=0.0744, over 4264615.53 frames. ], batch size: 702, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 10:49:28,307 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1552902.0, ans=0.125 2023-06-26 10:50:35,325 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.505e+02 5.144e+02 7.174e+02 1.026e+03 2.687e+03, threshold=1.435e+03, percent-clipped=5.0 2023-06-26 10:50:50,336 INFO [train.py:996] (2/4) Epoch 9, batch 14900, loss[loss=0.2218, simple_loss=0.2974, pruned_loss=0.07305, over 21804.00 frames. ], tot_loss[loss=0.228, simple_loss=0.3062, pruned_loss=0.07496, over 4264651.13 frames. ], batch size: 282, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 10:51:27,019 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1553202.0, ans=0.125 2023-06-26 10:51:27,069 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1553202.0, ans=0.0 2023-06-26 10:51:43,516 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1553262.0, ans=0.0 2023-06-26 10:51:43,979 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.22 vs. limit=15.0 2023-06-26 10:51:46,629 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1553262.0, ans=0.5 2023-06-26 10:52:18,301 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1553322.0, ans=0.125 2023-06-26 10:52:46,133 INFO [train.py:996] (2/4) Epoch 9, batch 14950, loss[loss=0.2929, simple_loss=0.3507, pruned_loss=0.1175, over 21453.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3082, pruned_loss=0.07525, over 4267214.60 frames. ], batch size: 509, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 10:53:06,356 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.30 vs. limit=22.5 2023-06-26 10:53:26,649 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.41 vs. limit=15.0 2023-06-26 10:54:13,818 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.13 vs. limit=12.0 2023-06-26 10:54:17,657 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.700e+02 5.284e+02 7.127e+02 1.003e+03 2.591e+03, threshold=1.425e+03, percent-clipped=12.0 2023-06-26 10:54:18,763 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1553682.0, ans=0.125 2023-06-26 10:54:36,031 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1553742.0, ans=0.0 2023-06-26 10:54:37,172 INFO [train.py:996] (2/4) Epoch 9, batch 15000, loss[loss=0.2204, simple_loss=0.2942, pruned_loss=0.07329, over 21385.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3106, pruned_loss=0.0765, over 4273383.06 frames. ], batch size: 548, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 10:54:37,173 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-26 10:54:55,453 INFO [train.py:1028] (2/4) Epoch 9, validation: loss=0.2558, simple_loss=0.3464, pruned_loss=0.08259, over 1796401.00 frames. 2023-06-26 10:54:55,454 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 23731MB 2023-06-26 10:55:44,112 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 10:55:58,114 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1553862.0, ans=0.0 2023-06-26 10:56:02,225 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1553862.0, ans=0.05 2023-06-26 10:56:05,783 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 10:56:13,076 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.88 vs. limit=15.0 2023-06-26 10:56:14,588 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1553922.0, ans=0.125 2023-06-26 10:56:46,885 INFO [train.py:996] (2/4) Epoch 9, batch 15050, loss[loss=0.2037, simple_loss=0.2601, pruned_loss=0.07361, over 21847.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3115, pruned_loss=0.07802, over 4279779.81 frames. ], batch size: 107, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 10:57:27,733 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1554102.0, ans=0.1 2023-06-26 10:58:21,871 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.687e+02 6.658e+02 1.222e+03 1.555e+03 2.780e+03, threshold=2.443e+03, percent-clipped=32.0 2023-06-26 10:58:36,581 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1554282.0, ans=0.2 2023-06-26 10:58:41,249 INFO [train.py:996] (2/4) Epoch 9, batch 15100, loss[loss=0.241, simple_loss=0.3151, pruned_loss=0.08338, over 21342.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3123, pruned_loss=0.07731, over 4270178.02 frames. ], batch size: 548, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 10:59:21,069 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1554402.0, ans=0.125 2023-06-26 10:59:36,904 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1554462.0, ans=0.0 2023-06-26 10:59:47,203 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1554462.0, ans=0.0 2023-06-26 11:00:04,407 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1554522.0, ans=0.2 2023-06-26 11:00:09,335 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1554582.0, ans=0.2 2023-06-26 11:00:29,600 INFO [train.py:996] (2/4) Epoch 9, batch 15150, loss[loss=0.1796, simple_loss=0.2538, pruned_loss=0.05271, over 21631.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.309, pruned_loss=0.07775, over 4272768.71 frames. ], batch size: 282, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 11:00:47,039 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1554642.0, ans=0.125 2023-06-26 11:00:50,397 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1554642.0, ans=0.0 2023-06-26 11:01:10,668 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.36 vs. limit=15.0 2023-06-26 11:01:31,854 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.85 vs. limit=5.0 2023-06-26 11:02:05,232 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.559e+02 4.649e+02 7.475e+02 1.057e+03 2.217e+03, threshold=1.495e+03, percent-clipped=0.0 2023-06-26 11:02:12,123 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.04 vs. limit=15.0 2023-06-26 11:02:19,221 INFO [train.py:996] (2/4) Epoch 9, batch 15200, loss[loss=0.1656, simple_loss=0.2468, pruned_loss=0.04217, over 21575.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.3018, pruned_loss=0.07403, over 4261958.46 frames. ], batch size: 263, lr: 3.29e-03, grad_scale: 32.0 2023-06-26 11:02:28,851 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 11:03:11,665 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.67 vs. limit=22.5 2023-06-26 11:03:57,601 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1555182.0, ans=0.125 2023-06-26 11:04:02,259 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1555182.0, ans=0.0 2023-06-26 11:04:13,004 INFO [train.py:996] (2/4) Epoch 9, batch 15250, loss[loss=0.2159, simple_loss=0.2852, pruned_loss=0.0733, over 21450.00 frames. ], tot_loss[loss=0.2199, simple_loss=0.2956, pruned_loss=0.07213, over 4263389.12 frames. ], batch size: 389, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 11:05:03,357 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.66 vs. limit=15.0 2023-06-26 11:05:20,388 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1555422.0, ans=0.0 2023-06-26 11:05:44,374 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.438e+02 5.756e+02 7.926e+02 1.187e+03 2.967e+03, threshold=1.585e+03, percent-clipped=10.0 2023-06-26 11:06:02,510 INFO [train.py:996] (2/4) Epoch 9, batch 15300, loss[loss=0.2523, simple_loss=0.3414, pruned_loss=0.08159, over 17950.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.2984, pruned_loss=0.07443, over 4267937.09 frames. ], batch size: 61, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 11:07:01,760 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1555662.0, ans=0.0 2023-06-26 11:07:21,627 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1555722.0, ans=0.125 2023-06-26 11:07:52,677 INFO [train.py:996] (2/4) Epoch 9, batch 15350, loss[loss=0.2868, simple_loss=0.3606, pruned_loss=0.1066, over 21417.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.3025, pruned_loss=0.07604, over 4270640.27 frames. ], batch size: 507, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 11:08:53,700 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1556022.0, ans=0.0 2023-06-26 11:09:22,279 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.533e+02 5.256e+02 7.334e+02 1.092e+03 2.120e+03, threshold=1.467e+03, percent-clipped=2.0 2023-06-26 11:09:39,836 INFO [train.py:996] (2/4) Epoch 9, batch 15400, loss[loss=0.2232, simple_loss=0.3093, pruned_loss=0.0686, over 21475.00 frames. ], tot_loss[loss=0.2266, simple_loss=0.3035, pruned_loss=0.0748, over 4273694.14 frames. ], batch size: 131, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 11:09:56,076 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1556142.0, ans=0.125 2023-06-26 11:10:13,399 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1556202.0, ans=0.125 2023-06-26 11:10:49,899 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1556322.0, ans=0.1 2023-06-26 11:11:13,669 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1556382.0, ans=0.125 2023-06-26 11:11:23,609 INFO [train.py:996] (2/4) Epoch 9, batch 15450, loss[loss=0.208, simple_loss=0.3027, pruned_loss=0.05668, over 21797.00 frames. ], tot_loss[loss=0.225, simple_loss=0.3015, pruned_loss=0.07424, over 4273736.50 frames. ], batch size: 351, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 11:11:24,040 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1556442.0, ans=0.2 2023-06-26 11:12:40,933 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1556622.0, ans=0.125 2023-06-26 11:13:01,735 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.285e+02 4.674e+02 6.020e+02 7.889e+02 1.710e+03, threshold=1.204e+03, percent-clipped=2.0 2023-06-26 11:13:20,040 INFO [train.py:996] (2/4) Epoch 9, batch 15500, loss[loss=0.2751, simple_loss=0.3453, pruned_loss=0.1024, over 21330.00 frames. ], tot_loss[loss=0.2276, simple_loss=0.3051, pruned_loss=0.07502, over 4254653.80 frames. ], batch size: 143, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 11:14:18,591 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1556922.0, ans=0.125 2023-06-26 11:15:07,106 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1556982.0, ans=0.04949747468305833 2023-06-26 11:15:10,405 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1557042.0, ans=0.125 2023-06-26 11:15:11,424 INFO [train.py:996] (2/4) Epoch 9, batch 15550, loss[loss=0.1975, simple_loss=0.2686, pruned_loss=0.06322, over 21344.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.3034, pruned_loss=0.07195, over 4255366.79 frames. ], batch size: 160, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 11:15:21,031 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1557042.0, ans=0.0 2023-06-26 11:15:34,822 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1557102.0, ans=0.125 2023-06-26 11:15:35,394 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=12.64 vs. limit=15.0 2023-06-26 11:15:52,259 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1557162.0, ans=0.2 2023-06-26 11:16:11,923 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1557162.0, ans=0.0 2023-06-26 11:16:11,982 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1557162.0, ans=0.0 2023-06-26 11:16:15,479 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.82 vs. limit=12.0 2023-06-26 11:16:41,917 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.382e+02 5.062e+02 7.091e+02 1.054e+03 2.391e+03, threshold=1.418e+03, percent-clipped=18.0 2023-06-26 11:16:59,948 INFO [train.py:996] (2/4) Epoch 9, batch 15600, loss[loss=0.2553, simple_loss=0.3156, pruned_loss=0.09747, over 21384.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.2984, pruned_loss=0.0706, over 4248091.76 frames. ], batch size: 508, lr: 3.28e-03, grad_scale: 32.0 2023-06-26 11:17:06,566 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.04 vs. limit=15.0 2023-06-26 11:17:51,453 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1557462.0, ans=0.0 2023-06-26 11:18:48,391 INFO [train.py:996] (2/4) Epoch 9, batch 15650, loss[loss=0.2135, simple_loss=0.2761, pruned_loss=0.07547, over 21464.00 frames. ], tot_loss[loss=0.2183, simple_loss=0.2968, pruned_loss=0.06989, over 4253712.22 frames. ], batch size: 441, lr: 3.28e-03, grad_scale: 32.0 2023-06-26 11:18:54,314 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1557642.0, ans=0.1 2023-06-26 11:19:35,226 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.17 vs. limit=12.0 2023-06-26 11:20:25,562 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.281e+02 4.437e+02 5.415e+02 7.572e+02 1.667e+03, threshold=1.083e+03, percent-clipped=3.0 2023-06-26 11:20:43,538 INFO [train.py:996] (2/4) Epoch 9, batch 15700, loss[loss=0.2047, simple_loss=0.2711, pruned_loss=0.06909, over 21857.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2925, pruned_loss=0.06906, over 4248036.51 frames. ], batch size: 107, lr: 3.28e-03, grad_scale: 32.0 2023-06-26 11:21:02,742 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1558002.0, ans=0.2 2023-06-26 11:21:51,748 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1558122.0, ans=0.0 2023-06-26 11:22:30,873 INFO [train.py:996] (2/4) Epoch 9, batch 15750, loss[loss=0.1847, simple_loss=0.2524, pruned_loss=0.05847, over 21957.00 frames. ], tot_loss[loss=0.213, simple_loss=0.2882, pruned_loss=0.06896, over 4258389.07 frames. ], batch size: 119, lr: 3.28e-03, grad_scale: 32.0 2023-06-26 11:22:36,640 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1558242.0, ans=0.1 2023-06-26 11:22:41,910 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1558242.0, ans=0.125 2023-06-26 11:22:54,382 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1558302.0, ans=0.0 2023-06-26 11:24:01,258 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.153e+02 4.399e+02 6.641e+02 9.028e+02 1.552e+03, threshold=1.328e+03, percent-clipped=11.0 2023-06-26 11:24:18,422 INFO [train.py:996] (2/4) Epoch 9, batch 15800, loss[loss=0.2022, simple_loss=0.2607, pruned_loss=0.07185, over 21589.00 frames. ], tot_loss[loss=0.2103, simple_loss=0.2839, pruned_loss=0.06828, over 4265817.74 frames. ], batch size: 442, lr: 3.28e-03, grad_scale: 32.0 2023-06-26 11:24:53,028 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.77 vs. limit=10.0 2023-06-26 11:25:27,066 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=1558722.0, ans=0.95 2023-06-26 11:26:06,293 INFO [train.py:996] (2/4) Epoch 9, batch 15850, loss[loss=0.2364, simple_loss=0.3087, pruned_loss=0.08207, over 21387.00 frames. ], tot_loss[loss=0.213, simple_loss=0.2855, pruned_loss=0.07026, over 4270456.35 frames. ], batch size: 549, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 11:26:13,600 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1558842.0, ans=0.0 2023-06-26 11:26:41,857 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.61 vs. limit=22.5 2023-06-26 11:27:03,756 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1559022.0, ans=0.125 2023-06-26 11:27:21,440 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.73 vs. limit=15.0 2023-06-26 11:27:38,952 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.367e+02 5.055e+02 6.778e+02 9.936e+02 2.216e+03, threshold=1.356e+03, percent-clipped=9.0 2023-06-26 11:27:49,543 INFO [train.py:996] (2/4) Epoch 9, batch 15900, loss[loss=0.2169, simple_loss=0.2902, pruned_loss=0.07183, over 21833.00 frames. ], tot_loss[loss=0.211, simple_loss=0.2822, pruned_loss=0.06988, over 4278539.89 frames. ], batch size: 124, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 11:27:58,906 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1559142.0, ans=0.2 2023-06-26 11:28:14,504 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1559202.0, ans=0.125 2023-06-26 11:29:11,123 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1559322.0, ans=0.125 2023-06-26 11:29:21,509 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1559382.0, ans=0.125 2023-06-26 11:29:38,902 INFO [train.py:996] (2/4) Epoch 9, batch 15950, loss[loss=0.2061, simple_loss=0.2818, pruned_loss=0.06516, over 15503.00 frames. ], tot_loss[loss=0.2097, simple_loss=0.2833, pruned_loss=0.06803, over 4261330.13 frames. ], batch size: 60, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 11:30:18,138 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.63 vs. limit=22.5 2023-06-26 11:30:43,630 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1559622.0, ans=0.2 2023-06-26 11:31:17,728 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.330e+02 4.974e+02 7.474e+02 9.810e+02 2.700e+03, threshold=1.495e+03, percent-clipped=8.0 2023-06-26 11:31:28,100 INFO [train.py:996] (2/4) Epoch 9, batch 16000, loss[loss=0.1861, simple_loss=0.2586, pruned_loss=0.05682, over 21853.00 frames. ], tot_loss[loss=0.2079, simple_loss=0.2843, pruned_loss=0.06574, over 4269165.06 frames. ], batch size: 98, lr: 3.28e-03, grad_scale: 32.0 2023-06-26 11:32:32,767 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1559922.0, ans=0.0 2023-06-26 11:32:34,237 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1559922.0, ans=0.2 2023-06-26 11:33:15,460 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.47 vs. limit=6.0 2023-06-26 11:33:16,789 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.53 vs. limit=15.0 2023-06-26 11:33:17,742 INFO [train.py:996] (2/4) Epoch 9, batch 16050, loss[loss=0.27, simple_loss=0.3635, pruned_loss=0.08827, over 21656.00 frames. ], tot_loss[loss=0.208, simple_loss=0.2871, pruned_loss=0.06444, over 4262971.78 frames. ], batch size: 441, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 11:33:27,874 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1560042.0, ans=0.125 2023-06-26 11:33:29,523 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1560042.0, ans=0.125 2023-06-26 11:33:30,934 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1560042.0, ans=0.125 2023-06-26 11:34:06,365 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1560162.0, ans=0.125 2023-06-26 11:34:21,560 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1560222.0, ans=0.125 2023-06-26 11:34:35,779 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1560282.0, ans=0.125 2023-06-26 11:34:39,558 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 11:34:45,492 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.405e+02 5.738e+02 8.747e+02 1.434e+03 3.009e+03, threshold=1.749e+03, percent-clipped=21.0 2023-06-26 11:34:53,763 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1560282.0, ans=0.1 2023-06-26 11:35:05,347 INFO [train.py:996] (2/4) Epoch 9, batch 16100, loss[loss=0.1998, simple_loss=0.2863, pruned_loss=0.05663, over 21295.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.2908, pruned_loss=0.06483, over 4268251.18 frames. ], batch size: 176, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 11:35:17,761 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1560342.0, ans=0.0 2023-06-26 11:35:31,697 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1560402.0, ans=0.0 2023-06-26 11:35:59,793 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1560462.0, ans=0.125 2023-06-26 11:36:02,939 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1560462.0, ans=0.1 2023-06-26 11:36:22,645 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1560522.0, ans=0.125 2023-06-26 11:36:54,143 INFO [train.py:996] (2/4) Epoch 9, batch 16150, loss[loss=0.2488, simple_loss=0.3214, pruned_loss=0.08812, over 21637.00 frames. ], tot_loss[loss=0.213, simple_loss=0.2917, pruned_loss=0.0671, over 4278360.03 frames. ], batch size: 471, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 11:37:01,906 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.53 vs. limit=15.0 2023-06-26 11:37:04,820 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1560642.0, ans=0.125 2023-06-26 11:37:10,114 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1560642.0, ans=0.125 2023-06-26 11:37:50,848 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.81 vs. limit=22.5 2023-06-26 11:38:33,311 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.369e+02 5.523e+02 8.339e+02 1.289e+03 2.279e+03, threshold=1.668e+03, percent-clipped=10.0 2023-06-26 11:38:46,830 INFO [train.py:996] (2/4) Epoch 9, batch 16200, loss[loss=0.258, simple_loss=0.3329, pruned_loss=0.09153, over 21827.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2943, pruned_loss=0.06857, over 4283805.24 frames. ], batch size: 118, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 11:39:02,784 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.16 vs. limit=15.0 2023-06-26 11:39:28,101 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1561062.0, ans=0.1 2023-06-26 11:40:38,358 INFO [train.py:996] (2/4) Epoch 9, batch 16250, loss[loss=0.2051, simple_loss=0.2806, pruned_loss=0.0648, over 21808.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.2958, pruned_loss=0.07046, over 4284296.15 frames. ], batch size: 372, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 11:40:50,201 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1561242.0, ans=0.04949747468305833 2023-06-26 11:40:51,850 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1561242.0, ans=0.125 2023-06-26 11:41:30,630 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_na.min_abs, batch_count=1561362.0, ans=0.02 2023-06-26 11:41:49,004 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 11:42:17,693 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.175e+02 4.964e+02 6.149e+02 9.832e+02 2.311e+03, threshold=1.230e+03, percent-clipped=3.0 2023-06-26 11:42:26,091 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1561542.0, ans=0.0 2023-06-26 11:42:26,958 INFO [train.py:996] (2/4) Epoch 9, batch 16300, loss[loss=0.2057, simple_loss=0.2822, pruned_loss=0.06458, over 21745.00 frames. ], tot_loss[loss=0.2119, simple_loss=0.2895, pruned_loss=0.06718, over 4264182.55 frames. ], batch size: 371, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 11:43:31,502 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.38 vs. limit=10.0 2023-06-26 11:44:14,431 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1561782.0, ans=0.2 2023-06-26 11:44:17,133 INFO [train.py:996] (2/4) Epoch 9, batch 16350, loss[loss=0.1624, simple_loss=0.2357, pruned_loss=0.04457, over 21174.00 frames. ], tot_loss[loss=0.2122, simple_loss=0.2898, pruned_loss=0.06733, over 4259019.19 frames. ], batch size: 176, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 11:45:56,602 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.631e+02 4.777e+02 5.847e+02 7.634e+02 1.657e+03, threshold=1.169e+03, percent-clipped=4.0 2023-06-26 11:46:05,044 INFO [train.py:996] (2/4) Epoch 9, batch 16400, loss[loss=0.223, simple_loss=0.3011, pruned_loss=0.07248, over 21695.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2939, pruned_loss=0.0693, over 4267587.43 frames. ], batch size: 389, lr: 3.28e-03, grad_scale: 32.0 2023-06-26 11:47:28,959 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.74 vs. limit=12.0 2023-06-26 11:47:38,606 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1562382.0, ans=0.05 2023-06-26 11:47:42,276 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1562382.0, ans=0.125 2023-06-26 11:47:43,879 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1562382.0, ans=0.04949747468305833 2023-06-26 11:47:52,823 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1562442.0, ans=0.125 2023-06-26 11:47:54,201 INFO [train.py:996] (2/4) Epoch 9, batch 16450, loss[loss=0.2152, simple_loss=0.2803, pruned_loss=0.07503, over 21565.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2936, pruned_loss=0.06966, over 4278199.99 frames. ], batch size: 548, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 11:49:34,747 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.13 vs. limit=6.0 2023-06-26 11:49:36,730 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.419e+02 4.948e+02 6.287e+02 8.709e+02 1.538e+03, threshold=1.257e+03, percent-clipped=9.0 2023-06-26 11:49:44,339 INFO [train.py:996] (2/4) Epoch 9, batch 16500, loss[loss=0.1873, simple_loss=0.2574, pruned_loss=0.05862, over 21668.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2911, pruned_loss=0.06952, over 4280775.72 frames. ], batch size: 263, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 11:50:17,071 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=15.14 vs. limit=22.5 2023-06-26 11:51:10,562 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1562922.0, ans=0.0 2023-06-26 11:51:34,694 INFO [train.py:996] (2/4) Epoch 9, batch 16550, loss[loss=0.2054, simple_loss=0.2695, pruned_loss=0.0706, over 21359.00 frames. ], tot_loss[loss=0.2125, simple_loss=0.2897, pruned_loss=0.06762, over 4278517.02 frames. ], batch size: 159, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 11:52:28,661 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1563162.0, ans=0.0 2023-06-26 11:52:46,790 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.02 vs. limit=10.0 2023-06-26 11:53:20,074 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1563282.0, ans=0.1 2023-06-26 11:53:20,647 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=12.34 vs. limit=15.0 2023-06-26 11:53:22,133 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1563282.0, ans=0.125 2023-06-26 11:53:24,963 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.329e+02 6.129e+02 9.986e+02 1.624e+03 3.562e+03, threshold=1.997e+03, percent-clipped=34.0 2023-06-26 11:53:31,926 INFO [train.py:996] (2/4) Epoch 9, batch 16600, loss[loss=0.2573, simple_loss=0.3682, pruned_loss=0.07323, over 21773.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.2964, pruned_loss=0.07031, over 4275461.61 frames. ], batch size: 332, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 11:53:34,543 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1563342.0, ans=0.0 2023-06-26 11:53:50,882 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.72 vs. limit=10.0 2023-06-26 11:54:11,257 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.83 vs. limit=15.0 2023-06-26 11:55:02,037 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=1563522.0, ans=0.95 2023-06-26 11:55:03,793 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1563582.0, ans=0.125 2023-06-26 11:55:29,123 INFO [train.py:996] (2/4) Epoch 9, batch 16650, loss[loss=0.266, simple_loss=0.346, pruned_loss=0.09296, over 21448.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.306, pruned_loss=0.07234, over 4275402.87 frames. ], batch size: 131, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 11:55:47,169 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1563642.0, ans=0.0 2023-06-26 11:56:50,782 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.42 vs. limit=22.5 2023-06-26 11:56:51,935 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 11:57:19,515 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=10.13 vs. limit=15.0 2023-06-26 11:57:21,322 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.537e+02 4.947e+02 6.891e+02 9.517e+02 1.890e+03, threshold=1.378e+03, percent-clipped=0.0 2023-06-26 11:57:33,714 INFO [train.py:996] (2/4) Epoch 9, batch 16700, loss[loss=0.2157, simple_loss=0.2935, pruned_loss=0.06891, over 20668.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.3056, pruned_loss=0.07283, over 4274386.45 frames. ], batch size: 607, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 11:58:26,317 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1564062.0, ans=0.0 2023-06-26 11:59:29,001 INFO [train.py:996] (2/4) Epoch 9, batch 16750, loss[loss=0.3327, simple_loss=0.4096, pruned_loss=0.1279, over 21441.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.3086, pruned_loss=0.07542, over 4271319.98 frames. ], batch size: 507, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 11:59:39,380 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1564242.0, ans=0.125 2023-06-26 12:00:29,295 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1564362.0, ans=0.1 2023-06-26 12:01:05,391 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1564482.0, ans=0.0 2023-06-26 12:01:13,733 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.747e+02 5.518e+02 7.489e+02 1.102e+03 1.868e+03, threshold=1.498e+03, percent-clipped=9.0 2023-06-26 12:01:20,317 INFO [train.py:996] (2/4) Epoch 9, batch 16800, loss[loss=0.2275, simple_loss=0.3023, pruned_loss=0.07633, over 21803.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.3137, pruned_loss=0.07559, over 4266703.84 frames. ], batch size: 112, lr: 3.28e-03, grad_scale: 32.0 2023-06-26 12:01:36,341 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1564542.0, ans=0.125 2023-06-26 12:02:00,526 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1564602.0, ans=0.2 2023-06-26 12:03:09,554 INFO [train.py:996] (2/4) Epoch 9, batch 16850, loss[loss=0.2329, simple_loss=0.3687, pruned_loss=0.04859, over 20799.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3105, pruned_loss=0.07529, over 4276524.42 frames. ], batch size: 607, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 12:03:19,218 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1564842.0, ans=0.0 2023-06-26 12:03:21,077 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1564842.0, ans=0.125 2023-06-26 12:04:34,658 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.36 vs. limit=15.0 2023-06-26 12:04:52,110 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.430e+02 5.138e+02 7.609e+02 1.062e+03 2.399e+03, threshold=1.522e+03, percent-clipped=7.0 2023-06-26 12:05:01,940 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.38 vs. limit=12.0 2023-06-26 12:05:02,266 INFO [train.py:996] (2/4) Epoch 9, batch 16900, loss[loss=0.175, simple_loss=0.2581, pruned_loss=0.04597, over 21609.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.3049, pruned_loss=0.07301, over 4275825.12 frames. ], batch size: 230, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:05:45,528 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1565262.0, ans=0.1 2023-06-26 12:06:07,887 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1565322.0, ans=0.125 2023-06-26 12:06:11,712 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1565322.0, ans=0.0 2023-06-26 12:06:11,730 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1565322.0, ans=0.1 2023-06-26 12:06:32,218 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1565382.0, ans=0.1 2023-06-26 12:06:43,803 INFO [train.py:996] (2/4) Epoch 9, batch 16950, loss[loss=0.2011, simple_loss=0.2702, pruned_loss=0.06596, over 21687.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.2977, pruned_loss=0.07127, over 4279795.31 frames. ], batch size: 230, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:07:07,424 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1565442.0, ans=0.2 2023-06-26 12:07:48,083 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.36 vs. limit=15.0 2023-06-26 12:08:27,255 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.850e+02 5.163e+02 6.810e+02 8.799e+02 2.047e+03, threshold=1.362e+03, percent-clipped=3.0 2023-06-26 12:08:28,163 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1565682.0, ans=0.0 2023-06-26 12:08:28,170 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1565682.0, ans=0.125 2023-06-26 12:08:32,658 INFO [train.py:996] (2/4) Epoch 9, batch 17000, loss[loss=0.2211, simple_loss=0.2924, pruned_loss=0.07491, over 21847.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.295, pruned_loss=0.07204, over 4288696.29 frames. ], batch size: 124, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:08:53,562 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 12:08:55,254 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1565742.0, ans=0.125 2023-06-26 12:09:10,242 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.23 vs. limit=15.0 2023-06-26 12:09:22,544 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=1565802.0, ans=0.025 2023-06-26 12:09:32,577 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1565862.0, ans=0.125 2023-06-26 12:09:42,296 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.90 vs. limit=6.0 2023-06-26 12:09:44,888 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1565862.0, ans=0.125 2023-06-26 12:09:44,943 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1565862.0, ans=0.125 2023-06-26 12:09:46,667 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1565922.0, ans=0.1 2023-06-26 12:10:27,409 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.86 vs. limit=12.0 2023-06-26 12:10:29,847 INFO [train.py:996] (2/4) Epoch 9, batch 17050, loss[loss=0.2075, simple_loss=0.2649, pruned_loss=0.07504, over 20274.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.3023, pruned_loss=0.0742, over 4284521.71 frames. ], batch size: 703, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:10:49,959 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1566042.0, ans=0.0 2023-06-26 12:11:05,552 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1566102.0, ans=0.1 2023-06-26 12:11:07,051 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1566102.0, ans=0.125 2023-06-26 12:11:09,071 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1566102.0, ans=0.1 2023-06-26 12:11:26,041 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.12 vs. limit=10.0 2023-06-26 12:11:55,785 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1566282.0, ans=0.0 2023-06-26 12:11:59,066 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1566282.0, ans=0.125 2023-06-26 12:12:06,990 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.631e+02 5.718e+02 8.769e+02 1.372e+03 2.605e+03, threshold=1.754e+03, percent-clipped=26.0 2023-06-26 12:12:17,827 INFO [train.py:996] (2/4) Epoch 9, batch 17100, loss[loss=0.2217, simple_loss=0.2937, pruned_loss=0.07491, over 21947.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.3003, pruned_loss=0.07443, over 4288872.77 frames. ], batch size: 351, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:12:31,810 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1566342.0, ans=0.125 2023-06-26 12:12:37,109 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1566342.0, ans=0.125 2023-06-26 12:13:31,662 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1566522.0, ans=0.1 2023-06-26 12:13:46,742 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.29 vs. limit=6.0 2023-06-26 12:14:10,889 INFO [train.py:996] (2/4) Epoch 9, batch 17150, loss[loss=0.2053, simple_loss=0.2764, pruned_loss=0.06708, over 21573.00 frames. ], tot_loss[loss=0.223, simple_loss=0.2975, pruned_loss=0.07426, over 4291534.93 frames. ], batch size: 212, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:14:25,719 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 12:14:43,541 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1566702.0, ans=0.125 2023-06-26 12:15:03,656 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.81 vs. limit=15.0 2023-06-26 12:15:36,726 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1566882.0, ans=0.125 2023-06-26 12:15:45,388 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1566882.0, ans=0.2 2023-06-26 12:15:55,037 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.574e+02 4.824e+02 6.813e+02 1.101e+03 2.342e+03, threshold=1.363e+03, percent-clipped=2.0 2023-06-26 12:16:00,504 INFO [train.py:996] (2/4) Epoch 9, batch 17200, loss[loss=0.2147, simple_loss=0.2928, pruned_loss=0.06828, over 21720.00 frames. ], tot_loss[loss=0.223, simple_loss=0.2983, pruned_loss=0.07386, over 4292544.07 frames. ], batch size: 332, lr: 3.27e-03, grad_scale: 32.0 2023-06-26 12:16:38,166 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1567002.0, ans=0.125 2023-06-26 12:16:45,751 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1567062.0, ans=0.04949747468305833 2023-06-26 12:17:03,946 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.12 vs. limit=15.0 2023-06-26 12:17:43,166 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1567182.0, ans=0.1 2023-06-26 12:18:02,446 INFO [train.py:996] (2/4) Epoch 9, batch 17250, loss[loss=0.2432, simple_loss=0.3324, pruned_loss=0.07696, over 21320.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.301, pruned_loss=0.07535, over 4293883.16 frames. ], batch size: 159, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:18:20,814 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1567302.0, ans=0.035 2023-06-26 12:18:22,453 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1567302.0, ans=0.125 2023-06-26 12:18:35,609 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1567302.0, ans=0.0 2023-06-26 12:19:19,400 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1567422.0, ans=0.0 2023-06-26 12:19:43,618 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1567482.0, ans=0.125 2023-06-26 12:19:45,891 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1567482.0, ans=0.09899494936611666 2023-06-26 12:19:48,794 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.788e+02 5.514e+02 7.810e+02 1.291e+03 2.321e+03, threshold=1.562e+03, percent-clipped=17.0 2023-06-26 12:19:52,282 INFO [train.py:996] (2/4) Epoch 9, batch 17300, loss[loss=0.2489, simple_loss=0.3278, pruned_loss=0.08498, over 21645.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3074, pruned_loss=0.07739, over 4287142.66 frames. ], batch size: 389, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:19:57,200 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.10 vs. limit=15.0 2023-06-26 12:20:26,589 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.77 vs. limit=15.0 2023-06-26 12:20:36,541 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1567662.0, ans=0.125 2023-06-26 12:21:14,724 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1567782.0, ans=0.07 2023-06-26 12:21:27,410 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.34 vs. limit=22.5 2023-06-26 12:21:38,668 INFO [train.py:996] (2/4) Epoch 9, batch 17350, loss[loss=0.1913, simple_loss=0.2852, pruned_loss=0.04873, over 21788.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.309, pruned_loss=0.07742, over 4287109.87 frames. ], batch size: 316, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:21:41,212 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1567842.0, ans=0.125 2023-06-26 12:22:24,908 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.76 vs. limit=22.5 2023-06-26 12:22:44,646 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1568022.0, ans=0.025 2023-06-26 12:22:56,059 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.13 vs. limit=15.0 2023-06-26 12:23:15,915 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.405e+02 5.455e+02 8.630e+02 1.274e+03 2.528e+03, threshold=1.726e+03, percent-clipped=16.0 2023-06-26 12:23:19,222 INFO [train.py:996] (2/4) Epoch 9, batch 17400, loss[loss=0.2209, simple_loss=0.2947, pruned_loss=0.07355, over 20801.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.3066, pruned_loss=0.07482, over 4288218.27 frames. ], batch size: 611, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:24:43,381 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1568322.0, ans=0.0 2023-06-26 12:24:51,123 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.87 vs. limit=15.0 2023-06-26 12:24:57,566 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1568382.0, ans=0.0 2023-06-26 12:25:00,555 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1568382.0, ans=0.125 2023-06-26 12:25:10,963 INFO [train.py:996] (2/4) Epoch 9, batch 17450, loss[loss=0.1815, simple_loss=0.2579, pruned_loss=0.05255, over 21210.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.3028, pruned_loss=0.07189, over 4282896.72 frames. ], batch size: 176, lr: 3.27e-03, grad_scale: 8.0 2023-06-26 12:25:24,600 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1568442.0, ans=0.0 2023-06-26 12:25:34,223 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.10 vs. limit=10.0 2023-06-26 12:26:01,540 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1568562.0, ans=0.125 2023-06-26 12:26:14,503 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.08 vs. limit=15.0 2023-06-26 12:26:42,131 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1568682.0, ans=0.2 2023-06-26 12:26:57,100 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.248e+02 4.686e+02 6.725e+02 1.029e+03 2.928e+03, threshold=1.345e+03, percent-clipped=7.0 2023-06-26 12:26:58,682 INFO [train.py:996] (2/4) Epoch 9, batch 17500, loss[loss=0.2239, simple_loss=0.2958, pruned_loss=0.07597, over 21670.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.298, pruned_loss=0.06951, over 4291338.65 frames. ], batch size: 441, lr: 3.27e-03, grad_scale: 8.0 2023-06-26 12:27:06,376 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1568742.0, ans=0.125 2023-06-26 12:27:54,564 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1568862.0, ans=0.125 2023-06-26 12:28:10,646 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=16.15 vs. limit=15.0 2023-06-26 12:28:13,544 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1568922.0, ans=0.1 2023-06-26 12:28:14,071 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.56 vs. limit=15.0 2023-06-26 12:28:25,832 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1568982.0, ans=0.125 2023-06-26 12:28:41,025 INFO [train.py:996] (2/4) Epoch 9, batch 17550, loss[loss=0.2009, simple_loss=0.2945, pruned_loss=0.05362, over 21634.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2984, pruned_loss=0.06883, over 4282468.14 frames. ], batch size: 230, lr: 3.27e-03, grad_scale: 8.0 2023-06-26 12:28:54,072 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1569042.0, ans=0.0 2023-06-26 12:29:09,613 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1569102.0, ans=0.2 2023-06-26 12:29:13,089 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1569102.0, ans=0.125 2023-06-26 12:30:05,227 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1569222.0, ans=0.2 2023-06-26 12:30:30,330 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.55 vs. limit=12.0 2023-06-26 12:30:33,054 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1569282.0, ans=0.0 2023-06-26 12:30:34,075 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.446e+02 4.559e+02 6.477e+02 8.639e+02 1.603e+03, threshold=1.295e+03, percent-clipped=2.0 2023-06-26 12:30:35,798 INFO [train.py:996] (2/4) Epoch 9, batch 17600, loss[loss=0.2264, simple_loss=0.3125, pruned_loss=0.07018, over 21919.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.3009, pruned_loss=0.06986, over 4280123.54 frames. ], batch size: 372, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:30:38,698 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.52 vs. limit=15.0 2023-06-26 12:30:43,688 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1569342.0, ans=0.0 2023-06-26 12:31:07,124 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1569402.0, ans=0.125 2023-06-26 12:31:10,700 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1569402.0, ans=0.125 2023-06-26 12:31:12,978 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.00 vs. limit=15.0 2023-06-26 12:31:14,242 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1569402.0, ans=0.125 2023-06-26 12:31:48,173 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1569522.0, ans=0.1 2023-06-26 12:32:20,283 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1569642.0, ans=0.0 2023-06-26 12:32:21,705 INFO [train.py:996] (2/4) Epoch 9, batch 17650, loss[loss=0.2113, simple_loss=0.2967, pruned_loss=0.06293, over 21463.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.2987, pruned_loss=0.0701, over 4283373.88 frames. ], batch size: 471, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:33:11,166 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1569762.0, ans=0.1 2023-06-26 12:33:49,939 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1569822.0, ans=0.125 2023-06-26 12:34:09,282 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.770e+02 6.180e+02 8.586e+02 1.472e+03 2.723e+03, threshold=1.717e+03, percent-clipped=31.0 2023-06-26 12:34:10,910 INFO [train.py:996] (2/4) Epoch 9, batch 17700, loss[loss=0.2471, simple_loss=0.3293, pruned_loss=0.08247, over 21344.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2947, pruned_loss=0.06791, over 4277479.60 frames. ], batch size: 549, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:35:06,173 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1570062.0, ans=0.0 2023-06-26 12:35:16,496 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1570062.0, ans=0.0 2023-06-26 12:35:55,681 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1570182.0, ans=0.1 2023-06-26 12:36:06,988 INFO [train.py:996] (2/4) Epoch 9, batch 17750, loss[loss=0.2252, simple_loss=0.3007, pruned_loss=0.07488, over 21823.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.3013, pruned_loss=0.07099, over 4279353.69 frames. ], batch size: 282, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:36:11,433 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1570242.0, ans=0.1 2023-06-26 12:36:41,090 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1570302.0, ans=0.125 2023-06-26 12:36:48,675 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.92 vs. limit=15.0 2023-06-26 12:36:50,028 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1570362.0, ans=0.2 2023-06-26 12:37:46,923 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1570482.0, ans=0.0 2023-06-26 12:37:54,222 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1570482.0, ans=0.2 2023-06-26 12:37:56,888 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.878e+02 5.343e+02 8.043e+02 1.136e+03 2.008e+03, threshold=1.609e+03, percent-clipped=5.0 2023-06-26 12:38:04,117 INFO [train.py:996] (2/4) Epoch 9, batch 17800, loss[loss=0.2703, simple_loss=0.346, pruned_loss=0.09736, over 21467.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.3011, pruned_loss=0.0699, over 4271861.06 frames. ], batch size: 471, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:39:05,015 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.78 vs. limit=15.0 2023-06-26 12:39:47,091 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1570782.0, ans=0.125 2023-06-26 12:39:55,311 INFO [train.py:996] (2/4) Epoch 9, batch 17850, loss[loss=0.2588, simple_loss=0.3392, pruned_loss=0.08917, over 21594.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.3034, pruned_loss=0.07066, over 4269730.49 frames. ], batch size: 414, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:40:21,727 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1570902.0, ans=0.0 2023-06-26 12:40:39,570 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1570962.0, ans=0.125 2023-06-26 12:41:42,266 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.429e+02 5.491e+02 8.059e+02 1.156e+03 1.916e+03, threshold=1.612e+03, percent-clipped=10.0 2023-06-26 12:41:43,905 INFO [train.py:996] (2/4) Epoch 9, batch 17900, loss[loss=0.2113, simple_loss=0.2997, pruned_loss=0.0615, over 21305.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.3075, pruned_loss=0.07236, over 4267866.60 frames. ], batch size: 176, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:42:21,738 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.90 vs. limit=22.5 2023-06-26 12:42:34,271 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.42 vs. limit=15.0 2023-06-26 12:42:40,752 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1571262.0, ans=0.0 2023-06-26 12:43:40,944 INFO [train.py:996] (2/4) Epoch 9, batch 17950, loss[loss=0.181, simple_loss=0.2727, pruned_loss=0.04462, over 21679.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.3072, pruned_loss=0.07004, over 4269726.53 frames. ], batch size: 247, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:43:43,605 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.46 vs. limit=15.0 2023-06-26 12:44:08,712 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1571502.0, ans=0.0 2023-06-26 12:44:25,674 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.20 vs. limit=15.0 2023-06-26 12:44:49,357 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1571622.0, ans=0.0 2023-06-26 12:45:24,824 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.199e+02 4.426e+02 5.727e+02 7.254e+02 1.857e+03, threshold=1.145e+03, percent-clipped=1.0 2023-06-26 12:45:26,478 INFO [train.py:996] (2/4) Epoch 9, batch 18000, loss[loss=0.1997, simple_loss=0.2756, pruned_loss=0.0619, over 21331.00 frames. ], tot_loss[loss=0.2181, simple_loss=0.3001, pruned_loss=0.06807, over 4263471.73 frames. ], batch size: 131, lr: 3.27e-03, grad_scale: 32.0 2023-06-26 12:45:26,478 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-26 12:45:46,681 INFO [train.py:1028] (2/4) Epoch 9, validation: loss=0.2587, simple_loss=0.3543, pruned_loss=0.08153, over 1796401.00 frames. 2023-06-26 12:45:46,682 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 23731MB 2023-06-26 12:46:00,159 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1571742.0, ans=0.125 2023-06-26 12:46:21,109 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1571802.0, ans=0.125 2023-06-26 12:46:26,051 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1571862.0, ans=0.125 2023-06-26 12:47:36,557 INFO [train.py:996] (2/4) Epoch 9, batch 18050, loss[loss=0.2024, simple_loss=0.2761, pruned_loss=0.06434, over 21839.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2925, pruned_loss=0.06717, over 4260443.22 frames. ], batch size: 372, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:47:46,115 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1572042.0, ans=0.125 2023-06-26 12:47:53,550 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1572102.0, ans=0.125 2023-06-26 12:48:18,407 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1572162.0, ans=0.1 2023-06-26 12:48:21,901 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1572162.0, ans=0.0 2023-06-26 12:49:28,429 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.634e+02 5.417e+02 6.596e+02 1.071e+03 2.802e+03, threshold=1.319e+03, percent-clipped=21.0 2023-06-26 12:49:28,467 INFO [train.py:996] (2/4) Epoch 9, batch 18100, loss[loss=0.2117, simple_loss=0.3124, pruned_loss=0.05549, over 20696.00 frames. ], tot_loss[loss=0.2183, simple_loss=0.2975, pruned_loss=0.06959, over 4262352.03 frames. ], batch size: 607, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:49:34,601 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1572342.0, ans=0.0 2023-06-26 12:51:08,511 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1572582.0, ans=0.025 2023-06-26 12:51:18,354 INFO [train.py:996] (2/4) Epoch 9, batch 18150, loss[loss=0.1938, simple_loss=0.2673, pruned_loss=0.06016, over 21311.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.2996, pruned_loss=0.06983, over 4253030.02 frames. ], batch size: 144, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:51:31,193 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1572642.0, ans=0.0 2023-06-26 12:52:04,632 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1572762.0, ans=0.125 2023-06-26 12:52:20,878 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1572762.0, ans=0.0 2023-06-26 12:52:54,739 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1572882.0, ans=0.125 2023-06-26 12:53:05,713 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.440e+02 4.565e+02 5.741e+02 8.756e+02 1.817e+03, threshold=1.148e+03, percent-clipped=4.0 2023-06-26 12:53:05,763 INFO [train.py:996] (2/4) Epoch 9, batch 18200, loss[loss=0.2126, simple_loss=0.2789, pruned_loss=0.07312, over 21728.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2938, pruned_loss=0.06943, over 4248725.22 frames. ], batch size: 112, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:53:17,842 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1572942.0, ans=0.125 2023-06-26 12:53:19,689 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1572942.0, ans=0.0 2023-06-26 12:54:47,262 INFO [train.py:996] (2/4) Epoch 9, batch 18250, loss[loss=0.2451, simple_loss=0.3098, pruned_loss=0.09016, over 21764.00 frames. ], tot_loss[loss=0.2107, simple_loss=0.2866, pruned_loss=0.0674, over 4255035.73 frames. ], batch size: 441, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:55:04,912 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 12:55:08,703 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1573302.0, ans=0.0 2023-06-26 12:55:45,472 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.03 vs. limit=15.0 2023-06-26 12:56:10,628 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1573422.0, ans=0.2 2023-06-26 12:56:42,094 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.384e+02 4.816e+02 6.355e+02 8.859e+02 2.523e+03, threshold=1.271e+03, percent-clipped=14.0 2023-06-26 12:56:42,125 INFO [train.py:996] (2/4) Epoch 9, batch 18300, loss[loss=0.2306, simple_loss=0.3344, pruned_loss=0.06337, over 21754.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.2874, pruned_loss=0.06742, over 4259126.60 frames. ], batch size: 298, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:57:05,617 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1573602.0, ans=0.0 2023-06-26 12:57:24,389 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1573662.0, ans=0.125 2023-06-26 12:57:36,607 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1573662.0, ans=0.125 2023-06-26 12:58:25,484 INFO [train.py:996] (2/4) Epoch 9, batch 18350, loss[loss=0.1857, simple_loss=0.2611, pruned_loss=0.05511, over 21585.00 frames. ], tot_loss[loss=0.213, simple_loss=0.2917, pruned_loss=0.06716, over 4251365.93 frames. ], batch size: 263, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:58:43,568 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 12:59:01,528 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1573902.0, ans=0.5 2023-06-26 12:59:53,767 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1574082.0, ans=0.125 2023-06-26 13:00:14,629 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.067e+02 5.416e+02 7.058e+02 9.535e+02 2.465e+03, threshold=1.412e+03, percent-clipped=12.0 2023-06-26 13:00:14,666 INFO [train.py:996] (2/4) Epoch 9, batch 18400, loss[loss=0.1952, simple_loss=0.2762, pruned_loss=0.05709, over 21211.00 frames. ], tot_loss[loss=0.2101, simple_loss=0.2881, pruned_loss=0.06606, over 4255799.95 frames. ], batch size: 176, lr: 3.27e-03, grad_scale: 32.0 2023-06-26 13:00:39,586 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.87 vs. limit=15.0 2023-06-26 13:01:14,423 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1574262.0, ans=0.125 2023-06-26 13:01:16,228 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1574262.0, ans=0.1 2023-06-26 13:01:54,386 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1574382.0, ans=0.0 2023-06-26 13:02:04,291 INFO [train.py:996] (2/4) Epoch 9, batch 18450, loss[loss=0.1689, simple_loss=0.243, pruned_loss=0.04743, over 21182.00 frames. ], tot_loss[loss=0.2058, simple_loss=0.2856, pruned_loss=0.06297, over 4250863.73 frames. ], batch size: 143, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 13:03:19,541 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1574622.0, ans=0.0 2023-06-26 13:03:28,193 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1574682.0, ans=0.2 2023-06-26 13:03:52,199 INFO [train.py:996] (2/4) Epoch 9, batch 18500, loss[loss=0.1736, simple_loss=0.2594, pruned_loss=0.04391, over 21555.00 frames. ], tot_loss[loss=0.2017, simple_loss=0.2801, pruned_loss=0.06165, over 4259687.59 frames. ], batch size: 230, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:03:53,905 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.096e+02 4.756e+02 7.398e+02 1.037e+03 4.377e+03, threshold=1.480e+03, percent-clipped=11.0 2023-06-26 13:03:56,337 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1574742.0, ans=0.2 2023-06-26 13:03:56,820 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.92 vs. limit=10.0 2023-06-26 13:04:03,585 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1574742.0, ans=0.0 2023-06-26 13:04:09,277 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.11 vs. limit=15.0 2023-06-26 13:05:39,045 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1575042.0, ans=0.2 2023-06-26 13:05:40,075 INFO [train.py:996] (2/4) Epoch 9, batch 18550, loss[loss=0.1816, simple_loss=0.275, pruned_loss=0.04412, over 21786.00 frames. ], tot_loss[loss=0.1989, simple_loss=0.2769, pruned_loss=0.06048, over 4251361.41 frames. ], batch size: 282, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:06:03,269 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1575102.0, ans=0.04949747468305833 2023-06-26 13:06:14,838 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1575102.0, ans=0.125 2023-06-26 13:06:51,102 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1575222.0, ans=0.125 2023-06-26 13:06:54,260 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1575222.0, ans=0.09899494936611666 2023-06-26 13:07:19,535 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.28 vs. limit=15.0 2023-06-26 13:07:28,389 INFO [train.py:996] (2/4) Epoch 9, batch 18600, loss[loss=0.1828, simple_loss=0.266, pruned_loss=0.04978, over 21629.00 frames. ], tot_loss[loss=0.1998, simple_loss=0.2758, pruned_loss=0.06185, over 4257362.95 frames. ], batch size: 263, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:07:30,259 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.128e+02 4.633e+02 7.387e+02 1.048e+03 1.831e+03, threshold=1.477e+03, percent-clipped=1.0 2023-06-26 13:08:24,156 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1575462.0, ans=0.125 2023-06-26 13:08:39,651 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1575522.0, ans=0.125 2023-06-26 13:09:05,237 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1575582.0, ans=0.0 2023-06-26 13:09:15,097 INFO [train.py:996] (2/4) Epoch 9, batch 18650, loss[loss=0.1826, simple_loss=0.2518, pruned_loss=0.05671, over 21584.00 frames. ], tot_loss[loss=0.1995, simple_loss=0.2755, pruned_loss=0.06171, over 4257743.28 frames. ], batch size: 230, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:09:35,699 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1575702.0, ans=0.125 2023-06-26 13:09:56,364 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1575762.0, ans=0.0 2023-06-26 13:09:56,374 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1575762.0, ans=0.0 2023-06-26 13:10:49,454 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1575882.0, ans=0.125 2023-06-26 13:11:00,265 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.65 vs. limit=12.0 2023-06-26 13:11:02,378 INFO [train.py:996] (2/4) Epoch 9, batch 18700, loss[loss=0.2253, simple_loss=0.2871, pruned_loss=0.0818, over 21190.00 frames. ], tot_loss[loss=0.2005, simple_loss=0.2746, pruned_loss=0.06327, over 4254593.39 frames. ], batch size: 143, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:11:04,043 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.187e+02 4.395e+02 5.926e+02 8.949e+02 1.374e+03, threshold=1.185e+03, percent-clipped=0.0 2023-06-26 13:11:04,792 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1575942.0, ans=0.0 2023-06-26 13:11:31,191 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1576002.0, ans=0.0 2023-06-26 13:11:32,965 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1576002.0, ans=0.0 2023-06-26 13:11:54,974 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1576062.0, ans=0.0 2023-06-26 13:12:49,679 INFO [train.py:996] (2/4) Epoch 9, batch 18750, loss[loss=0.196, simple_loss=0.2448, pruned_loss=0.07354, over 20306.00 frames. ], tot_loss[loss=0.2039, simple_loss=0.2763, pruned_loss=0.06572, over 4240313.48 frames. ], batch size: 703, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:13:16,238 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1576302.0, ans=0.125 2023-06-26 13:13:16,364 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1576302.0, ans=0.0 2023-06-26 13:13:23,002 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1576362.0, ans=0.1 2023-06-26 13:13:49,628 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1576422.0, ans=0.0 2023-06-26 13:14:03,985 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1576422.0, ans=0.125 2023-06-26 13:14:34,025 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1576482.0, ans=0.1 2023-06-26 13:14:34,093 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1576482.0, ans=0.125 2023-06-26 13:14:38,317 INFO [train.py:996] (2/4) Epoch 9, batch 18800, loss[loss=0.1915, simple_loss=0.2469, pruned_loss=0.068, over 20266.00 frames. ], tot_loss[loss=0.208, simple_loss=0.282, pruned_loss=0.06695, over 4239957.89 frames. ], batch size: 703, lr: 3.26e-03, grad_scale: 32.0 2023-06-26 13:14:40,099 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.559e+02 6.038e+02 7.723e+02 1.097e+03 3.023e+03, threshold=1.545e+03, percent-clipped=19.0 2023-06-26 13:14:42,909 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1576542.0, ans=0.2 2023-06-26 13:14:48,601 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1576542.0, ans=0.2 2023-06-26 13:16:07,372 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1576722.0, ans=0.125 2023-06-26 13:16:27,814 INFO [train.py:996] (2/4) Epoch 9, batch 18850, loss[loss=0.1604, simple_loss=0.2413, pruned_loss=0.03971, over 21242.00 frames. ], tot_loss[loss=0.2022, simple_loss=0.2787, pruned_loss=0.06281, over 4241512.10 frames. ], batch size: 176, lr: 3.26e-03, grad_scale: 32.0 2023-06-26 13:16:28,804 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.62 vs. limit=12.0 2023-06-26 13:16:51,121 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.77 vs. limit=10.0 2023-06-26 13:18:05,145 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.75 vs. limit=15.0 2023-06-26 13:18:06,187 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1577082.0, ans=0.125 2023-06-26 13:18:08,107 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1577082.0, ans=0.125 2023-06-26 13:18:14,393 INFO [train.py:996] (2/4) Epoch 9, batch 18900, loss[loss=0.2447, simple_loss=0.2916, pruned_loss=0.09891, over 21722.00 frames. ], tot_loss[loss=0.2007, simple_loss=0.2755, pruned_loss=0.06298, over 4249670.27 frames. ], batch size: 511, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:18:17,656 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.228e+02 4.531e+02 6.963e+02 9.490e+02 1.932e+03, threshold=1.393e+03, percent-clipped=3.0 2023-06-26 13:18:43,186 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1577202.0, ans=0.05 2023-06-26 13:18:55,531 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.18 vs. limit=12.0 2023-06-26 13:19:10,452 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1577322.0, ans=10.0 2023-06-26 13:19:16,464 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.91 vs. limit=22.5 2023-06-26 13:19:33,507 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1577322.0, ans=0.125 2023-06-26 13:19:45,541 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.14 vs. limit=6.0 2023-06-26 13:19:51,807 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1577382.0, ans=0.2 2023-06-26 13:20:03,827 INFO [train.py:996] (2/4) Epoch 9, batch 18950, loss[loss=0.2804, simple_loss=0.3572, pruned_loss=0.1018, over 21727.00 frames. ], tot_loss[loss=0.204, simple_loss=0.2773, pruned_loss=0.06537, over 4267602.22 frames. ], batch size: 511, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:20:04,693 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1577442.0, ans=0.125 2023-06-26 13:21:30,849 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1577622.0, ans=0.125 2023-06-26 13:21:42,932 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1577682.0, ans=0.1 2023-06-26 13:21:54,033 INFO [train.py:996] (2/4) Epoch 9, batch 19000, loss[loss=0.2521, simple_loss=0.3254, pruned_loss=0.08943, over 21295.00 frames. ], tot_loss[loss=0.2099, simple_loss=0.2864, pruned_loss=0.06671, over 4270895.54 frames. ], batch size: 143, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:21:58,082 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.501e+02 4.865e+02 6.670e+02 8.887e+02 1.787e+03, threshold=1.334e+03, percent-clipped=6.0 2023-06-26 13:22:00,365 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1577742.0, ans=0.2 2023-06-26 13:22:07,001 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1577742.0, ans=0.07 2023-06-26 13:22:09,286 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.28 vs. limit=15.0 2023-06-26 13:22:25,564 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1577802.0, ans=0.1 2023-06-26 13:23:37,528 INFO [train.py:996] (2/4) Epoch 9, batch 19050, loss[loss=0.2322, simple_loss=0.3064, pruned_loss=0.07903, over 21657.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2913, pruned_loss=0.07068, over 4279047.91 frames. ], batch size: 389, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:23:47,114 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=12.39 vs. limit=15.0 2023-06-26 13:24:12,775 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1578102.0, ans=0.125 2023-06-26 13:25:20,509 INFO [train.py:996] (2/4) Epoch 9, batch 19100, loss[loss=0.1962, simple_loss=0.2654, pruned_loss=0.06349, over 21435.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2911, pruned_loss=0.07207, over 4285882.26 frames. ], batch size: 195, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:25:24,158 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.781e+02 5.304e+02 7.054e+02 1.099e+03 1.877e+03, threshold=1.411e+03, percent-clipped=10.0 2023-06-26 13:26:31,626 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1578462.0, ans=0.125 2023-06-26 13:26:37,130 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1578522.0, ans=0.125 2023-06-26 13:26:42,272 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1578522.0, ans=0.0 2023-06-26 13:26:46,770 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.64 vs. limit=15.0 2023-06-26 13:26:49,747 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1578522.0, ans=0.125 2023-06-26 13:27:11,396 INFO [train.py:996] (2/4) Epoch 9, batch 19150, loss[loss=0.2283, simple_loss=0.2934, pruned_loss=0.08161, over 19887.00 frames. ], tot_loss[loss=0.219, simple_loss=0.2931, pruned_loss=0.07242, over 4285991.62 frames. ], batch size: 702, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:28:14,570 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1578762.0, ans=0.125 2023-06-26 13:29:06,074 INFO [train.py:996] (2/4) Epoch 9, batch 19200, loss[loss=0.1872, simple_loss=0.2735, pruned_loss=0.05043, over 21268.00 frames. ], tot_loss[loss=0.223, simple_loss=0.3002, pruned_loss=0.07297, over 4278368.21 frames. ], batch size: 159, lr: 3.26e-03, grad_scale: 32.0 2023-06-26 13:29:10,044 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.893e+02 6.153e+02 9.835e+02 1.321e+03 2.570e+03, threshold=1.967e+03, percent-clipped=19.0 2023-06-26 13:29:39,106 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.10 vs. limit=15.0 2023-06-26 13:29:50,094 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1579002.0, ans=0.125 2023-06-26 13:29:57,225 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1579062.0, ans=0.2 2023-06-26 13:30:28,403 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1579182.0, ans=0.125 2023-06-26 13:30:35,373 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1579182.0, ans=0.0 2023-06-26 13:30:49,831 INFO [train.py:996] (2/4) Epoch 9, batch 19250, loss[loss=0.2131, simple_loss=0.2912, pruned_loss=0.06751, over 21855.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.2999, pruned_loss=0.06829, over 4279980.31 frames. ], batch size: 107, lr: 3.26e-03, grad_scale: 32.0 2023-06-26 13:31:58,853 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.55 vs. limit=6.0 2023-06-26 13:32:01,756 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1579422.0, ans=0.0 2023-06-26 13:32:38,034 INFO [train.py:996] (2/4) Epoch 9, batch 19300, loss[loss=0.1961, simple_loss=0.2741, pruned_loss=0.05906, over 21878.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.297, pruned_loss=0.06794, over 4277406.90 frames. ], batch size: 351, lr: 3.26e-03, grad_scale: 32.0 2023-06-26 13:32:40,474 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1579542.0, ans=0.125 2023-06-26 13:32:41,543 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.842e+02 4.708e+02 6.632e+02 9.817e+02 2.132e+03, threshold=1.326e+03, percent-clipped=1.0 2023-06-26 13:32:54,588 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1579602.0, ans=0.025 2023-06-26 13:33:21,328 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1579602.0, ans=0.125 2023-06-26 13:33:30,892 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=1579662.0, ans=0.5 2023-06-26 13:34:23,277 INFO [train.py:996] (2/4) Epoch 9, batch 19350, loss[loss=0.1772, simple_loss=0.2681, pruned_loss=0.04308, over 21711.00 frames. ], tot_loss[loss=0.211, simple_loss=0.2934, pruned_loss=0.06427, over 4277937.96 frames. ], batch size: 351, lr: 3.26e-03, grad_scale: 32.0 2023-06-26 13:34:43,690 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.11 vs. limit=15.0 2023-06-26 13:35:08,171 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.04 vs. limit=15.0 2023-06-26 13:35:24,971 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1579962.0, ans=0.1 2023-06-26 13:35:27,212 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.12 vs. limit=12.0 2023-06-26 13:35:28,599 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1579962.0, ans=0.0 2023-06-26 13:35:32,452 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.11 vs. limit=15.0 2023-06-26 13:35:44,760 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.52 vs. limit=12.0 2023-06-26 13:35:45,815 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1580022.0, ans=0.07 2023-06-26 13:36:09,278 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1580142.0, ans=0.1 2023-06-26 13:36:10,345 INFO [train.py:996] (2/4) Epoch 9, batch 19400, loss[loss=0.2252, simple_loss=0.3029, pruned_loss=0.07369, over 21817.00 frames. ], tot_loss[loss=0.2083, simple_loss=0.2902, pruned_loss=0.06323, over 4280025.74 frames. ], batch size: 391, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:36:15,968 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.056e+02 5.043e+02 7.685e+02 1.074e+03 1.940e+03, threshold=1.537e+03, percent-clipped=16.0 2023-06-26 13:37:04,057 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1580262.0, ans=0.2 2023-06-26 13:37:23,654 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.47 vs. limit=22.5 2023-06-26 13:37:24,724 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1580322.0, ans=0.125 2023-06-26 13:37:31,559 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1580382.0, ans=0.2 2023-06-26 13:37:33,037 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1580382.0, ans=0.125 2023-06-26 13:37:33,039 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1580382.0, ans=0.0 2023-06-26 13:37:53,589 INFO [train.py:996] (2/4) Epoch 9, batch 19450, loss[loss=0.2528, simple_loss=0.3289, pruned_loss=0.08836, over 14853.00 frames. ], tot_loss[loss=0.2082, simple_loss=0.2875, pruned_loss=0.06451, over 4273586.39 frames. ], batch size: 60, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:37:54,319 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 13:38:01,121 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1580442.0, ans=0.04949747468305833 2023-06-26 13:38:10,940 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1580442.0, ans=0.0 2023-06-26 13:38:10,994 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 13:38:11,011 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1580442.0, ans=0.125 2023-06-26 13:39:03,870 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.27 vs. limit=22.5 2023-06-26 13:39:35,474 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1580682.0, ans=0.1 2023-06-26 13:39:36,950 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1580682.0, ans=0.1 2023-06-26 13:39:41,553 INFO [train.py:996] (2/4) Epoch 9, batch 19500, loss[loss=0.1871, simple_loss=0.243, pruned_loss=0.06562, over 21108.00 frames. ], tot_loss[loss=0.2077, simple_loss=0.2842, pruned_loss=0.06564, over 4263435.40 frames. ], batch size: 143, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:39:46,900 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.457e+02 4.487e+02 6.079e+02 9.287e+02 2.149e+03, threshold=1.216e+03, percent-clipped=7.0 2023-06-26 13:40:25,154 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1580802.0, ans=0.04949747468305833 2023-06-26 13:41:31,330 INFO [train.py:996] (2/4) Epoch 9, batch 19550, loss[loss=0.1701, simple_loss=0.2491, pruned_loss=0.04556, over 21199.00 frames. ], tot_loss[loss=0.2039, simple_loss=0.2791, pruned_loss=0.06434, over 4250055.56 frames. ], batch size: 159, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:41:31,978 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1581042.0, ans=0.125 2023-06-26 13:41:35,906 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1581042.0, ans=0.1 2023-06-26 13:41:44,042 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1581042.0, ans=0.05 2023-06-26 13:41:46,276 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.11 vs. limit=15.0 2023-06-26 13:42:23,753 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1581162.0, ans=0.125 2023-06-26 13:42:26,898 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1581162.0, ans=0.125 2023-06-26 13:42:28,794 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1581162.0, ans=0.125 2023-06-26 13:42:37,521 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1581222.0, ans=0.0 2023-06-26 13:43:03,441 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1581282.0, ans=0.1 2023-06-26 13:43:17,692 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1581342.0, ans=0.125 2023-06-26 13:43:18,841 INFO [train.py:996] (2/4) Epoch 9, batch 19600, loss[loss=0.2232, simple_loss=0.295, pruned_loss=0.07569, over 21481.00 frames. ], tot_loss[loss=0.2059, simple_loss=0.2821, pruned_loss=0.0649, over 4256325.18 frames. ], batch size: 194, lr: 3.26e-03, grad_scale: 32.0 2023-06-26 13:43:29,309 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.268e+02 5.045e+02 6.281e+02 9.154e+02 2.396e+03, threshold=1.256e+03, percent-clipped=14.0 2023-06-26 13:43:49,683 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1581342.0, ans=0.125 2023-06-26 13:44:04,459 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.55 vs. limit=15.0 2023-06-26 13:44:19,028 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1581462.0, ans=0.1 2023-06-26 13:44:27,830 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1581522.0, ans=0.125 2023-06-26 13:44:31,634 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1581522.0, ans=0.0 2023-06-26 13:44:32,201 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.11 vs. limit=15.0 2023-06-26 13:45:13,605 INFO [train.py:996] (2/4) Epoch 9, batch 19650, loss[loss=0.2165, simple_loss=0.29, pruned_loss=0.07151, over 21950.00 frames. ], tot_loss[loss=0.2123, simple_loss=0.2873, pruned_loss=0.06862, over 4265987.93 frames. ], batch size: 316, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:46:57,976 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1581882.0, ans=0.1 2023-06-26 13:47:15,806 INFO [train.py:996] (2/4) Epoch 9, batch 19700, loss[loss=0.184, simple_loss=0.2669, pruned_loss=0.05055, over 21359.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2911, pruned_loss=0.06973, over 4268283.51 frames. ], batch size: 194, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:47:22,843 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.461e+02 6.188e+02 8.447e+02 1.401e+03 2.428e+03, threshold=1.689e+03, percent-clipped=28.0 2023-06-26 13:47:32,633 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1582002.0, ans=0.1 2023-06-26 13:48:40,725 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1582122.0, ans=0.1 2023-06-26 13:49:06,333 INFO [train.py:996] (2/4) Epoch 9, batch 19750, loss[loss=0.2303, simple_loss=0.3283, pruned_loss=0.06616, over 21285.00 frames. ], tot_loss[loss=0.2219, simple_loss=0.3012, pruned_loss=0.07132, over 4269732.26 frames. ], batch size: 176, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:49:12,300 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1582242.0, ans=0.1 2023-06-26 13:49:55,042 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1582362.0, ans=0.125 2023-06-26 13:50:05,665 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1582422.0, ans=0.0 2023-06-26 13:50:22,049 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1582422.0, ans=0.125 2023-06-26 13:50:55,628 INFO [train.py:996] (2/4) Epoch 9, batch 19800, loss[loss=0.2078, simple_loss=0.2707, pruned_loss=0.07246, over 19975.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.3001, pruned_loss=0.07157, over 4267685.91 frames. ], batch size: 702, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:51:02,879 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.312e+02 6.209e+02 8.156e+02 1.271e+03 2.290e+03, threshold=1.631e+03, percent-clipped=8.0 2023-06-26 13:51:35,390 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1582662.0, ans=0.125 2023-06-26 13:52:40,912 INFO [train.py:996] (2/4) Epoch 9, batch 19850, loss[loss=0.1732, simple_loss=0.2562, pruned_loss=0.0451, over 21600.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2936, pruned_loss=0.06743, over 4270002.09 frames. ], batch size: 263, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:53:00,718 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1582842.0, ans=0.0 2023-06-26 13:53:52,163 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1583022.0, ans=0.125 2023-06-26 13:53:52,844 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.72 vs. limit=15.0 2023-06-26 13:54:18,144 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1583082.0, ans=10.0 2023-06-26 13:54:27,764 INFO [train.py:996] (2/4) Epoch 9, batch 19900, loss[loss=0.1952, simple_loss=0.2713, pruned_loss=0.05958, over 21575.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.2931, pruned_loss=0.06457, over 4267391.92 frames. ], batch size: 247, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:54:30,340 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1583142.0, ans=0.125 2023-06-26 13:54:34,751 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.163e+02 4.779e+02 6.020e+02 7.987e+02 2.016e+03, threshold=1.204e+03, percent-clipped=5.0 2023-06-26 13:55:47,269 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1583322.0, ans=0.125 2023-06-26 13:56:07,317 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=12.72 vs. limit=15.0 2023-06-26 13:56:18,956 INFO [train.py:996] (2/4) Epoch 9, batch 19950, loss[loss=0.2149, simple_loss=0.2931, pruned_loss=0.06839, over 21614.00 frames. ], tot_loss[loss=0.2072, simple_loss=0.2871, pruned_loss=0.06368, over 4263795.04 frames. ], batch size: 414, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:57:31,816 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1583622.0, ans=0.0 2023-06-26 13:57:46,873 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1583622.0, ans=0.125 2023-06-26 13:57:53,827 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1583682.0, ans=0.0 2023-06-26 13:58:06,798 INFO [train.py:996] (2/4) Epoch 9, batch 20000, loss[loss=0.2223, simple_loss=0.2847, pruned_loss=0.07998, over 20073.00 frames. ], tot_loss[loss=0.2069, simple_loss=0.2865, pruned_loss=0.06368, over 4272796.62 frames. ], batch size: 707, lr: 3.26e-03, grad_scale: 32.0 2023-06-26 13:58:07,692 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1583742.0, ans=0.1 2023-06-26 13:58:19,157 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.477e+02 4.524e+02 6.104e+02 8.785e+02 2.084e+03, threshold=1.221e+03, percent-clipped=7.0 2023-06-26 13:59:38,104 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1583982.0, ans=0.125 2023-06-26 13:59:56,239 INFO [train.py:996] (2/4) Epoch 9, batch 20050, loss[loss=0.231, simple_loss=0.2962, pruned_loss=0.08289, over 21567.00 frames. ], tot_loss[loss=0.2096, simple_loss=0.2883, pruned_loss=0.06546, over 4277936.83 frames. ], batch size: 548, lr: 3.26e-03, grad_scale: 32.0 2023-06-26 14:00:18,578 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.out_whiten.whitening_limit, batch_count=1584102.0, ans=8.0 2023-06-26 14:00:49,115 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1584162.0, ans=0.04949747468305833 2023-06-26 14:01:07,626 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.25 vs. limit=15.0 2023-06-26 14:01:51,601 INFO [train.py:996] (2/4) Epoch 9, batch 20100, loss[loss=0.2418, simple_loss=0.3334, pruned_loss=0.07512, over 21812.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.2919, pruned_loss=0.06798, over 4292063.85 frames. ], batch size: 332, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 14:02:00,513 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.881e+02 4.985e+02 7.812e+02 1.091e+03 2.146e+03, threshold=1.562e+03, percent-clipped=15.0 2023-06-26 14:02:08,038 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.96 vs. limit=15.0 2023-06-26 14:03:36,722 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1584582.0, ans=0.125 2023-06-26 14:03:43,021 INFO [train.py:996] (2/4) Epoch 9, batch 20150, loss[loss=0.2826, simple_loss=0.3474, pruned_loss=0.1088, over 21454.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.3018, pruned_loss=0.0712, over 4290969.72 frames. ], batch size: 471, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:03:54,260 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1584642.0, ans=0.1 2023-06-26 14:05:10,275 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.38 vs. limit=10.0 2023-06-26 14:05:10,297 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.06 vs. limit=15.0 2023-06-26 14:05:53,460 INFO [train.py:996] (2/4) Epoch 9, batch 20200, loss[loss=0.2863, simple_loss=0.3784, pruned_loss=0.09713, over 21701.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.3073, pruned_loss=0.07391, over 4291544.77 frames. ], batch size: 441, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:05:59,803 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1584942.0, ans=0.1 2023-06-26 14:06:02,521 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.033e+02 6.140e+02 1.031e+03 1.445e+03 3.124e+03, threshold=2.061e+03, percent-clipped=23.0 2023-06-26 14:06:24,677 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1585002.0, ans=0.125 2023-06-26 14:06:48,580 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.38 vs. limit=12.0 2023-06-26 14:07:03,484 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 14:07:07,462 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.42 vs. limit=22.5 2023-06-26 14:07:43,945 INFO [train.py:996] (2/4) Epoch 9, batch 20250, loss[loss=0.199, simple_loss=0.2904, pruned_loss=0.05381, over 20853.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.3075, pruned_loss=0.07277, over 4284076.64 frames. ], batch size: 607, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:07:49,693 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1585242.0, ans=0.2 2023-06-26 14:08:55,415 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1585422.0, ans=0.125 2023-06-26 14:09:18,624 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1585482.0, ans=0.125 2023-06-26 14:09:26,864 INFO [train.py:996] (2/4) Epoch 9, batch 20300, loss[loss=0.2023, simple_loss=0.289, pruned_loss=0.05783, over 21769.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.3042, pruned_loss=0.07015, over 4290129.22 frames. ], batch size: 282, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:09:35,566 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.339e+02 4.853e+02 6.521e+02 1.002e+03 2.689e+03, threshold=1.304e+03, percent-clipped=1.0 2023-06-26 14:09:40,423 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.21 vs. limit=15.0 2023-06-26 14:09:46,218 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1585602.0, ans=0.2 2023-06-26 14:09:51,512 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1585602.0, ans=0.2 2023-06-26 14:10:00,054 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1585662.0, ans=0.2 2023-06-26 14:10:00,562 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.30 vs. limit=15.0 2023-06-26 14:10:00,564 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=1585662.0, ans=10.0 2023-06-26 14:10:12,391 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1585662.0, ans=0.0 2023-06-26 14:11:15,959 INFO [train.py:996] (2/4) Epoch 9, batch 20350, loss[loss=0.2369, simple_loss=0.3115, pruned_loss=0.08113, over 21880.00 frames. ], tot_loss[loss=0.2248, simple_loss=0.3061, pruned_loss=0.07169, over 4283306.33 frames. ], batch size: 118, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:11:35,536 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=1585902.0, ans=15.0 2023-06-26 14:11:46,645 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1585902.0, ans=0.125 2023-06-26 14:12:01,907 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1585962.0, ans=0.2 2023-06-26 14:12:19,369 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1586022.0, ans=0.125 2023-06-26 14:12:44,416 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1586082.0, ans=0.125 2023-06-26 14:12:44,457 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1586082.0, ans=0.0 2023-06-26 14:13:04,281 INFO [train.py:996] (2/4) Epoch 9, batch 20400, loss[loss=0.2371, simple_loss=0.3134, pruned_loss=0.08039, over 21430.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.3084, pruned_loss=0.07402, over 4281694.46 frames. ], batch size: 211, lr: 3.25e-03, grad_scale: 32.0 2023-06-26 14:13:13,312 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.291e+02 5.756e+02 8.261e+02 1.227e+03 2.104e+03, threshold=1.652e+03, percent-clipped=22.0 2023-06-26 14:13:43,769 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1586262.0, ans=0.2 2023-06-26 14:13:55,899 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1586262.0, ans=0.0 2023-06-26 14:14:07,458 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1586322.0, ans=0.0 2023-06-26 14:14:20,476 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1586322.0, ans=0.2 2023-06-26 14:14:31,970 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1586382.0, ans=0.0 2023-06-26 14:14:51,019 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1586442.0, ans=0.125 2023-06-26 14:14:52,321 INFO [train.py:996] (2/4) Epoch 9, batch 20450, loss[loss=0.2115, simple_loss=0.2832, pruned_loss=0.06992, over 21944.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.3082, pruned_loss=0.07575, over 4272110.88 frames. ], batch size: 316, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:15:02,987 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1586442.0, ans=0.0 2023-06-26 14:16:15,917 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.11 vs. limit=15.0 2023-06-26 14:16:33,718 INFO [train.py:996] (2/4) Epoch 9, batch 20500, loss[loss=0.1933, simple_loss=0.2605, pruned_loss=0.06302, over 21617.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.3033, pruned_loss=0.07557, over 4269025.07 frames. ], batch size: 263, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:16:41,253 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1586742.0, ans=0.0 2023-06-26 14:16:42,680 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1586742.0, ans=0.125 2023-06-26 14:16:44,014 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.954e+02 5.491e+02 7.367e+02 1.069e+03 2.836e+03, threshold=1.473e+03, percent-clipped=8.0 2023-06-26 14:16:49,463 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1586802.0, ans=0.0 2023-06-26 14:16:56,998 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1586802.0, ans=0.1 2023-06-26 14:17:02,282 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1586802.0, ans=0.2 2023-06-26 14:17:20,273 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.31 vs. limit=15.0 2023-06-26 14:17:41,741 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1586922.0, ans=0.125 2023-06-26 14:18:21,713 INFO [train.py:996] (2/4) Epoch 9, batch 20550, loss[loss=0.2107, simple_loss=0.285, pruned_loss=0.06819, over 21570.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.2956, pruned_loss=0.07351, over 4273241.81 frames. ], batch size: 414, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:18:42,986 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1587102.0, ans=0.125 2023-06-26 14:19:07,830 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1587162.0, ans=0.0 2023-06-26 14:19:09,585 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1587162.0, ans=0.125 2023-06-26 14:19:14,637 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1587162.0, ans=0.0 2023-06-26 14:20:08,356 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1587342.0, ans=0.125 2023-06-26 14:20:09,329 INFO [train.py:996] (2/4) Epoch 9, batch 20600, loss[loss=0.2488, simple_loss=0.3138, pruned_loss=0.09192, over 21786.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.2986, pruned_loss=0.0723, over 4267211.83 frames. ], batch size: 441, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:20:13,681 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1587342.0, ans=0.125 2023-06-26 14:20:17,111 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1587342.0, ans=0.125 2023-06-26 14:20:19,846 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.630e+02 4.988e+02 6.640e+02 9.393e+02 1.385e+03, threshold=1.328e+03, percent-clipped=0.0 2023-06-26 14:21:18,240 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=15.35 vs. limit=22.5 2023-06-26 14:21:41,927 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1587582.0, ans=0.2 2023-06-26 14:21:56,972 INFO [train.py:996] (2/4) Epoch 9, batch 20650, loss[loss=0.1921, simple_loss=0.2594, pruned_loss=0.0624, over 21688.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.2953, pruned_loss=0.07266, over 4274937.83 frames. ], batch size: 298, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:22:09,601 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1587642.0, ans=0.0 2023-06-26 14:22:32,987 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1587702.0, ans=0.125 2023-06-26 14:23:47,421 INFO [train.py:996] (2/4) Epoch 9, batch 20700, loss[loss=0.1646, simple_loss=0.2381, pruned_loss=0.04558, over 21768.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.2867, pruned_loss=0.06901, over 4265773.14 frames. ], batch size: 124, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:23:58,483 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.316e+02 4.993e+02 7.910e+02 1.068e+03 1.993e+03, threshold=1.582e+03, percent-clipped=12.0 2023-06-26 14:24:16,263 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1588002.0, ans=0.0 2023-06-26 14:24:35,176 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1588062.0, ans=0.1 2023-06-26 14:25:38,423 INFO [train.py:996] (2/4) Epoch 9, batch 20750, loss[loss=0.2301, simple_loss=0.329, pruned_loss=0.06562, over 21632.00 frames. ], tot_loss[loss=0.2123, simple_loss=0.289, pruned_loss=0.06784, over 4263178.90 frames. ], batch size: 263, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:25:49,238 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1588242.0, ans=0.0 2023-06-26 14:26:12,452 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1588302.0, ans=0.125 2023-06-26 14:26:18,077 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.24 vs. limit=6.0 2023-06-26 14:27:32,216 INFO [train.py:996] (2/4) Epoch 9, batch 20800, loss[loss=0.2122, simple_loss=0.2785, pruned_loss=0.07299, over 21468.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2936, pruned_loss=0.06941, over 4261291.52 frames. ], batch size: 441, lr: 3.25e-03, grad_scale: 32.0 2023-06-26 14:27:42,706 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.373e+02 6.315e+02 8.167e+02 1.529e+03 3.332e+03, threshold=1.633e+03, percent-clipped=23.0 2023-06-26 14:28:12,784 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1588662.0, ans=0.09899494936611666 2023-06-26 14:28:23,793 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.22 vs. limit=12.0 2023-06-26 14:28:49,156 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1588722.0, ans=0.2 2023-06-26 14:28:59,409 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1588782.0, ans=0.04949747468305833 2023-06-26 14:29:19,956 INFO [train.py:996] (2/4) Epoch 9, batch 20850, loss[loss=0.177, simple_loss=0.2512, pruned_loss=0.05146, over 21781.00 frames. ], tot_loss[loss=0.211, simple_loss=0.2866, pruned_loss=0.06773, over 4262217.62 frames. ], batch size: 247, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:29:21,193 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.80 vs. limit=5.0 2023-06-26 14:29:27,069 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1588842.0, ans=0.125 2023-06-26 14:29:41,667 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.98 vs. limit=22.5 2023-06-26 14:29:57,168 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1588962.0, ans=0.125 2023-06-26 14:31:08,540 INFO [train.py:996] (2/4) Epoch 9, batch 20900, loss[loss=0.1957, simple_loss=0.2775, pruned_loss=0.05697, over 21585.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.2861, pruned_loss=0.06853, over 4269492.59 frames. ], batch size: 263, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:31:09,011 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1589142.0, ans=0.125 2023-06-26 14:31:20,490 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.215e+02 4.594e+02 6.029e+02 1.010e+03 2.105e+03, threshold=1.206e+03, percent-clipped=4.0 2023-06-26 14:31:34,243 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1589202.0, ans=0.125 2023-06-26 14:32:29,153 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1589322.0, ans=0.5 2023-06-26 14:32:37,311 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1589382.0, ans=0.04949747468305833 2023-06-26 14:32:44,058 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1589382.0, ans=0.125 2023-06-26 14:32:48,528 INFO [train.py:996] (2/4) Epoch 9, batch 20950, loss[loss=0.1669, simple_loss=0.2402, pruned_loss=0.04683, over 21441.00 frames. ], tot_loss[loss=0.2069, simple_loss=0.2827, pruned_loss=0.06555, over 4269631.71 frames. ], batch size: 131, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:32:57,960 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1589442.0, ans=0.0 2023-06-26 14:33:22,542 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.89 vs. limit=10.0 2023-06-26 14:33:25,423 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1589502.0, ans=0.07 2023-06-26 14:33:39,366 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.92 vs. limit=15.0 2023-06-26 14:34:20,569 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.27 vs. limit=6.0 2023-06-26 14:34:27,088 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1589682.0, ans=0.0 2023-06-26 14:34:33,525 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1589682.0, ans=0.1 2023-06-26 14:34:36,214 INFO [train.py:996] (2/4) Epoch 9, batch 21000, loss[loss=0.2031, simple_loss=0.2786, pruned_loss=0.06386, over 21913.00 frames. ], tot_loss[loss=0.2063, simple_loss=0.2812, pruned_loss=0.0657, over 4266654.89 frames. ], batch size: 316, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:34:36,214 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-26 14:34:59,737 INFO [train.py:1028] (2/4) Epoch 9, validation: loss=0.2612, simple_loss=0.3587, pruned_loss=0.0819, over 1796401.00 frames. 2023-06-26 14:34:59,739 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 23731MB 2023-06-26 14:35:11,950 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.378e+02 4.892e+02 7.035e+02 1.069e+03 1.759e+03, threshold=1.407e+03, percent-clipped=17.0 2023-06-26 14:36:49,911 INFO [train.py:996] (2/4) Epoch 9, batch 21050, loss[loss=0.2053, simple_loss=0.2757, pruned_loss=0.06745, over 21278.00 frames. ], tot_loss[loss=0.2059, simple_loss=0.2796, pruned_loss=0.06606, over 4273899.65 frames. ], batch size: 177, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:37:50,001 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1590222.0, ans=0.2 2023-06-26 14:38:01,761 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.91 vs. limit=12.0 2023-06-26 14:38:03,596 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.55 vs. limit=22.5 2023-06-26 14:38:23,693 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1590282.0, ans=0.125 2023-06-26 14:38:36,823 INFO [train.py:996] (2/4) Epoch 9, batch 21100, loss[loss=0.1859, simple_loss=0.2295, pruned_loss=0.07115, over 20755.00 frames. ], tot_loss[loss=0.2039, simple_loss=0.2768, pruned_loss=0.06554, over 4275366.59 frames. ], batch size: 608, lr: 3.25e-03, grad_scale: 8.0 2023-06-26 14:38:50,944 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.682e+02 5.080e+02 7.538e+02 1.007e+03 2.026e+03, threshold=1.508e+03, percent-clipped=9.0 2023-06-26 14:39:30,528 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1590462.0, ans=0.125 2023-06-26 14:40:05,065 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_ff2.min_abs, batch_count=1590582.0, ans=0.1 2023-06-26 14:40:05,153 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1590582.0, ans=0.04949747468305833 2023-06-26 14:40:24,020 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1590642.0, ans=0.125 2023-06-26 14:40:25,043 INFO [train.py:996] (2/4) Epoch 9, batch 21150, loss[loss=0.2067, simple_loss=0.2717, pruned_loss=0.07087, over 21826.00 frames. ], tot_loss[loss=0.203, simple_loss=0.2743, pruned_loss=0.06588, over 4268752.42 frames. ], batch size: 107, lr: 3.25e-03, grad_scale: 8.0 2023-06-26 14:40:29,151 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1590642.0, ans=0.2 2023-06-26 14:40:32,306 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1590642.0, ans=0.125 2023-06-26 14:41:57,784 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1590882.0, ans=0.125 2023-06-26 14:42:12,186 INFO [train.py:996] (2/4) Epoch 9, batch 21200, loss[loss=0.1829, simple_loss=0.248, pruned_loss=0.0589, over 21224.00 frames. ], tot_loss[loss=0.2001, simple_loss=0.2706, pruned_loss=0.06478, over 4253409.85 frames. ], batch size: 176, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:42:16,586 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1590942.0, ans=0.125 2023-06-26 14:42:26,034 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.202e+02 4.962e+02 6.952e+02 8.758e+02 1.783e+03, threshold=1.390e+03, percent-clipped=2.0 2023-06-26 14:42:32,141 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1591002.0, ans=0.1 2023-06-26 14:43:07,426 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1591062.0, ans=0.125 2023-06-26 14:43:25,824 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1591122.0, ans=0.04949747468305833 2023-06-26 14:43:56,775 INFO [train.py:996] (2/4) Epoch 9, batch 21250, loss[loss=0.2404, simple_loss=0.3257, pruned_loss=0.07756, over 21606.00 frames. ], tot_loss[loss=0.1985, simple_loss=0.2683, pruned_loss=0.06438, over 4257276.05 frames. ], batch size: 389, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:45:20,728 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1591482.0, ans=0.125 2023-06-26 14:45:33,937 INFO [train.py:996] (2/4) Epoch 9, batch 21300, loss[loss=0.2104, simple_loss=0.2967, pruned_loss=0.06207, over 21737.00 frames. ], tot_loss[loss=0.2036, simple_loss=0.2748, pruned_loss=0.06619, over 4271532.70 frames. ], batch size: 298, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:45:41,291 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1591542.0, ans=0.125 2023-06-26 14:45:52,768 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.243e+02 5.570e+02 8.003e+02 1.129e+03 3.066e+03, threshold=1.601e+03, percent-clipped=15.0 2023-06-26 14:46:25,273 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1591662.0, ans=0.125 2023-06-26 14:46:42,369 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.99 vs. limit=15.0 2023-06-26 14:46:48,046 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.96 vs. limit=15.0 2023-06-26 14:47:23,324 INFO [train.py:996] (2/4) Epoch 9, batch 21350, loss[loss=0.2043, simple_loss=0.2972, pruned_loss=0.05573, over 21355.00 frames. ], tot_loss[loss=0.2081, simple_loss=0.2808, pruned_loss=0.06765, over 4276172.36 frames. ], batch size: 548, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:47:23,906 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1591842.0, ans=0.125 2023-06-26 14:47:43,236 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1591902.0, ans=0.125 2023-06-26 14:48:23,468 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1591962.0, ans=0.125 2023-06-26 14:48:31,231 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.62 vs. limit=12.0 2023-06-26 14:49:12,013 INFO [train.py:996] (2/4) Epoch 9, batch 21400, loss[loss=0.2437, simple_loss=0.3295, pruned_loss=0.07902, over 21466.00 frames. ], tot_loss[loss=0.2084, simple_loss=0.2828, pruned_loss=0.06699, over 4276234.23 frames. ], batch size: 131, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:49:25,998 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.278e+02 4.706e+02 6.583e+02 9.880e+02 2.077e+03, threshold=1.317e+03, percent-clipped=4.0 2023-06-26 14:49:45,815 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1592202.0, ans=0.2 2023-06-26 14:50:14,920 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.80 vs. limit=15.0 2023-06-26 14:50:39,576 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1592322.0, ans=0.0 2023-06-26 14:50:44,825 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1592382.0, ans=0.0 2023-06-26 14:51:00,487 INFO [train.py:996] (2/4) Epoch 9, batch 21450, loss[loss=0.2129, simple_loss=0.2821, pruned_loss=0.07184, over 21303.00 frames. ], tot_loss[loss=0.2125, simple_loss=0.2871, pruned_loss=0.06894, over 4280021.73 frames. ], batch size: 176, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:51:06,449 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1592442.0, ans=0.0 2023-06-26 14:51:32,963 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.04 vs. limit=12.0 2023-06-26 14:51:48,761 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1592562.0, ans=0.125 2023-06-26 14:52:15,014 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1592622.0, ans=0.125 2023-06-26 14:52:16,768 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1592622.0, ans=0.125 2023-06-26 14:52:43,810 INFO [train.py:996] (2/4) Epoch 9, batch 21500, loss[loss=0.2008, simple_loss=0.2658, pruned_loss=0.06789, over 21712.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.2862, pruned_loss=0.07002, over 4279584.73 frames. ], batch size: 333, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:53:03,332 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.822e+02 5.893e+02 8.169e+02 1.189e+03 2.218e+03, threshold=1.634e+03, percent-clipped=19.0 2023-06-26 14:53:38,096 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.15 vs. limit=22.5 2023-06-26 14:54:08,248 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1592922.0, ans=0.2 2023-06-26 14:54:11,804 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1592922.0, ans=0.125 2023-06-26 14:54:32,482 INFO [train.py:996] (2/4) Epoch 9, batch 21550, loss[loss=0.1768, simple_loss=0.2363, pruned_loss=0.05866, over 21212.00 frames. ], tot_loss[loss=0.2071, simple_loss=0.2797, pruned_loss=0.06727, over 4268069.82 frames. ], batch size: 548, lr: 3.25e-03, grad_scale: 8.0 2023-06-26 14:54:53,028 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1593042.0, ans=0.035 2023-06-26 14:55:38,987 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1593162.0, ans=0.125 2023-06-26 14:56:23,429 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1593282.0, ans=0.125 2023-06-26 14:56:26,272 INFO [train.py:996] (2/4) Epoch 9, batch 21600, loss[loss=0.1737, simple_loss=0.2313, pruned_loss=0.05807, over 21220.00 frames. ], tot_loss[loss=0.2024, simple_loss=0.2745, pruned_loss=0.06513, over 4263323.86 frames. ], batch size: 549, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:56:27,585 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1593342.0, ans=0.1 2023-06-26 14:56:34,115 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1593342.0, ans=0.07 2023-06-26 14:56:53,156 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.217e+02 4.926e+02 7.373e+02 9.794e+02 2.336e+03, threshold=1.475e+03, percent-clipped=12.0 2023-06-26 14:57:02,564 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1593402.0, ans=0.1 2023-06-26 14:57:17,633 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1593462.0, ans=0.125 2023-06-26 14:57:22,875 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1593462.0, ans=0.2 2023-06-26 14:57:55,307 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1593522.0, ans=0.0 2023-06-26 14:58:15,067 INFO [train.py:996] (2/4) Epoch 9, batch 21650, loss[loss=0.2, simple_loss=0.2983, pruned_loss=0.05085, over 21819.00 frames. ], tot_loss[loss=0.2021, simple_loss=0.2783, pruned_loss=0.06296, over 4258706.99 frames. ], batch size: 317, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:58:34,004 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1593642.0, ans=0.125 2023-06-26 14:58:47,841 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1593702.0, ans=0.1 2023-06-26 14:58:47,847 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1593702.0, ans=0.1 2023-06-26 14:59:07,110 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1593762.0, ans=0.125 2023-06-26 14:59:14,633 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.10 vs. limit=6.0 2023-06-26 14:59:16,454 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=1593762.0, ans=22.5 2023-06-26 14:59:32,994 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff2.min_abs, batch_count=1593822.0, ans=0.1 2023-06-26 14:59:33,014 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1593822.0, ans=0.04949747468305833 2023-06-26 14:59:45,305 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.76 vs. limit=15.0 2023-06-26 15:00:01,559 INFO [train.py:996] (2/4) Epoch 9, batch 21700, loss[loss=0.2053, simple_loss=0.2638, pruned_loss=0.07339, over 21324.00 frames. ], tot_loss[loss=0.2013, simple_loss=0.2786, pruned_loss=0.06196, over 4261857.16 frames. ], batch size: 144, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 15:00:06,955 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1593942.0, ans=0.2 2023-06-26 15:00:22,143 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.421e+02 4.737e+02 7.563e+02 1.159e+03 3.422e+03, threshold=1.513e+03, percent-clipped=14.0 2023-06-26 15:01:26,043 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1594122.0, ans=0.125 2023-06-26 15:01:36,146 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1594182.0, ans=0.0 2023-06-26 15:01:47,601 INFO [train.py:996] (2/4) Epoch 9, batch 21750, loss[loss=0.1894, simple_loss=0.2606, pruned_loss=0.05912, over 21851.00 frames. ], tot_loss[loss=0.1995, simple_loss=0.2749, pruned_loss=0.06203, over 4245151.38 frames. ], batch size: 107, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:02:28,117 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1594302.0, ans=0.125 2023-06-26 15:02:41,080 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1594362.0, ans=0.125 2023-06-26 15:03:10,223 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1594422.0, ans=0.2 2023-06-26 15:03:37,917 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=1594542.0, ans=15.0 2023-06-26 15:03:38,355 INFO [train.py:996] (2/4) Epoch 9, batch 21800, loss[loss=0.1895, simple_loss=0.255, pruned_loss=0.06198, over 21834.00 frames. ], tot_loss[loss=0.1997, simple_loss=0.2734, pruned_loss=0.06297, over 4241688.17 frames. ], batch size: 318, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:04:04,211 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.457e+02 4.843e+02 6.619e+02 9.442e+02 2.103e+03, threshold=1.324e+03, percent-clipped=2.0 2023-06-26 15:04:14,840 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1594602.0, ans=0.0 2023-06-26 15:04:31,743 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=16.63 vs. limit=22.5 2023-06-26 15:04:54,828 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1594722.0, ans=10.0 2023-06-26 15:05:25,820 INFO [train.py:996] (2/4) Epoch 9, batch 21850, loss[loss=0.1858, simple_loss=0.2474, pruned_loss=0.06206, over 19995.00 frames. ], tot_loss[loss=0.2027, simple_loss=0.2783, pruned_loss=0.06353, over 4250228.00 frames. ], batch size: 702, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:06:29,694 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.85 vs. limit=15.0 2023-06-26 15:07:12,449 INFO [train.py:996] (2/4) Epoch 9, batch 21900, loss[loss=0.2268, simple_loss=0.2822, pruned_loss=0.08577, over 21416.00 frames. ], tot_loss[loss=0.205, simple_loss=0.28, pruned_loss=0.06503, over 4261782.91 frames. ], batch size: 471, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:07:23,045 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1595142.0, ans=0.0 2023-06-26 15:07:38,242 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.466e+02 4.571e+02 6.004e+02 8.081e+02 1.811e+03, threshold=1.201e+03, percent-clipped=9.0 2023-06-26 15:07:42,140 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1595202.0, ans=0.07 2023-06-26 15:07:49,068 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1595202.0, ans=0.125 2023-06-26 15:07:49,696 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.30 vs. limit=15.0 2023-06-26 15:08:12,468 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.43 vs. limit=15.0 2023-06-26 15:08:58,216 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1595442.0, ans=0.0 2023-06-26 15:09:04,625 INFO [train.py:996] (2/4) Epoch 9, batch 21950, loss[loss=0.1786, simple_loss=0.2608, pruned_loss=0.04822, over 21661.00 frames. ], tot_loss[loss=0.202, simple_loss=0.2751, pruned_loss=0.06446, over 4270452.19 frames. ], batch size: 415, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:09:05,389 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1595442.0, ans=0.04949747468305833 2023-06-26 15:09:10,234 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1595442.0, ans=0.125 2023-06-26 15:09:47,833 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.66 vs. limit=15.0 2023-06-26 15:10:19,299 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.17 vs. limit=15.0 2023-06-26 15:10:53,772 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=1595742.0, ans=15.0 2023-06-26 15:10:54,254 INFO [train.py:996] (2/4) Epoch 9, batch 22000, loss[loss=0.186, simple_loss=0.2526, pruned_loss=0.0597, over 21588.00 frames. ], tot_loss[loss=0.1961, simple_loss=0.2691, pruned_loss=0.06155, over 4271925.83 frames. ], batch size: 263, lr: 3.24e-03, grad_scale: 32.0 2023-06-26 15:11:15,756 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.199e+02 4.453e+02 7.165e+02 9.999e+02 1.931e+03, threshold=1.433e+03, percent-clipped=13.0 2023-06-26 15:11:50,393 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1595862.0, ans=0.125 2023-06-26 15:12:22,310 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1595982.0, ans=0.125 2023-06-26 15:12:49,913 INFO [train.py:996] (2/4) Epoch 9, batch 22050, loss[loss=0.2414, simple_loss=0.3314, pruned_loss=0.07569, over 21642.00 frames. ], tot_loss[loss=0.1997, simple_loss=0.2734, pruned_loss=0.06299, over 4263995.93 frames. ], batch size: 389, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:13:05,727 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.60 vs. limit=15.0 2023-06-26 15:13:07,627 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.91 vs. limit=22.5 2023-06-26 15:13:33,919 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1596162.0, ans=0.0 2023-06-26 15:14:23,909 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1596282.0, ans=0.125 2023-06-26 15:14:24,564 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.45 vs. limit=15.0 2023-06-26 15:14:25,433 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1596282.0, ans=0.125 2023-06-26 15:14:38,918 INFO [train.py:996] (2/4) Epoch 9, batch 22100, loss[loss=0.1791, simple_loss=0.2491, pruned_loss=0.05459, over 16959.00 frames. ], tot_loss[loss=0.2097, simple_loss=0.2844, pruned_loss=0.06753, over 4247504.54 frames. ], batch size: 63, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:14:56,633 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.930e+02 6.282e+02 9.612e+02 1.455e+03 3.538e+03, threshold=1.922e+03, percent-clipped=29.0 2023-06-26 15:15:53,262 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1596522.0, ans=0.0 2023-06-26 15:15:55,009 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1596522.0, ans=0.5 2023-06-26 15:16:26,359 INFO [train.py:996] (2/4) Epoch 9, batch 22150, loss[loss=0.2453, simple_loss=0.3081, pruned_loss=0.09125, over 21705.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2879, pruned_loss=0.06953, over 4259823.62 frames. ], batch size: 473, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:16:32,081 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1596642.0, ans=0.0 2023-06-26 15:16:53,179 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1596702.0, ans=0.1 2023-06-26 15:16:53,226 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1596702.0, ans=0.125 2023-06-26 15:17:13,066 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1596762.0, ans=0.0 2023-06-26 15:17:34,857 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1596822.0, ans=0.2 2023-06-26 15:17:34,858 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1596822.0, ans=0.0 2023-06-26 15:17:38,447 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1596822.0, ans=0.0 2023-06-26 15:18:14,961 INFO [train.py:996] (2/4) Epoch 9, batch 22200, loss[loss=0.2104, simple_loss=0.2879, pruned_loss=0.06647, over 21288.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.289, pruned_loss=0.07011, over 4275962.38 frames. ], batch size: 159, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:18:32,778 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.802e+02 5.060e+02 7.082e+02 1.053e+03 2.242e+03, threshold=1.416e+03, percent-clipped=3.0 2023-06-26 15:19:02,299 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.69 vs. limit=22.5 2023-06-26 15:19:42,508 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1597182.0, ans=0.1 2023-06-26 15:19:54,052 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1597182.0, ans=0.125 2023-06-26 15:19:54,668 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.41 vs. limit=15.0 2023-06-26 15:20:04,052 INFO [train.py:996] (2/4) Epoch 9, batch 22250, loss[loss=0.2386, simple_loss=0.3176, pruned_loss=0.07979, over 21770.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.2983, pruned_loss=0.07194, over 4271330.64 frames. ], batch size: 247, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:20:54,715 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.42 vs. limit=22.5 2023-06-26 15:21:11,231 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.58 vs. limit=22.5 2023-06-26 15:21:24,412 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1597482.0, ans=0.1 2023-06-26 15:21:26,010 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1597482.0, ans=0.1 2023-06-26 15:21:46,501 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1597482.0, ans=0.0 2023-06-26 15:21:51,034 INFO [train.py:996] (2/4) Epoch 9, batch 22300, loss[loss=0.2164, simple_loss=0.2839, pruned_loss=0.07443, over 21870.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.3, pruned_loss=0.0739, over 4280310.68 frames. ], batch size: 332, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:22:08,320 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.561e+02 5.384e+02 7.516e+02 1.079e+03 3.010e+03, threshold=1.503e+03, percent-clipped=16.0 2023-06-26 15:22:56,172 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1597722.0, ans=0.125 2023-06-26 15:23:04,491 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1597722.0, ans=0.125 2023-06-26 15:23:21,728 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1597782.0, ans=0.2 2023-06-26 15:23:25,294 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1597782.0, ans=0.125 2023-06-26 15:23:33,872 INFO [train.py:996] (2/4) Epoch 9, batch 22350, loss[loss=0.2171, simple_loss=0.3099, pruned_loss=0.06214, over 17239.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.2982, pruned_loss=0.07462, over 4279094.41 frames. ], batch size: 60, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:23:36,066 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1597842.0, ans=0.1 2023-06-26 15:24:41,155 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1598022.0, ans=0.125 2023-06-26 15:25:06,714 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1598082.0, ans=0.2 2023-06-26 15:25:13,826 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1598082.0, ans=0.125 2023-06-26 15:25:18,796 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1598082.0, ans=0.125 2023-06-26 15:25:21,639 INFO [train.py:996] (2/4) Epoch 9, batch 22400, loss[loss=0.1943, simple_loss=0.2619, pruned_loss=0.0634, over 21307.00 frames. ], tot_loss[loss=0.22, simple_loss=0.2953, pruned_loss=0.07238, over 4274673.87 frames. ], batch size: 608, lr: 3.24e-03, grad_scale: 32.0 2023-06-26 15:25:24,783 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.10 vs. limit=15.0 2023-06-26 15:25:49,397 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.723e+02 5.104e+02 6.690e+02 9.796e+02 2.008e+03, threshold=1.338e+03, percent-clipped=2.0 2023-06-26 15:25:49,960 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1598202.0, ans=0.1 2023-06-26 15:26:41,986 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.03 vs. limit=10.0 2023-06-26 15:26:49,798 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1598382.0, ans=0.2 2023-06-26 15:27:03,604 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1598382.0, ans=0.1 2023-06-26 15:27:14,436 INFO [train.py:996] (2/4) Epoch 9, batch 22450, loss[loss=0.2379, simple_loss=0.2721, pruned_loss=0.1019, over 21317.00 frames. ], tot_loss[loss=0.2164, simple_loss=0.2897, pruned_loss=0.07154, over 4277716.84 frames. ], batch size: 507, lr: 3.24e-03, grad_scale: 32.0 2023-06-26 15:27:18,617 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1598442.0, ans=0.125 2023-06-26 15:27:59,951 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1598562.0, ans=0.125 2023-06-26 15:28:00,019 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1598562.0, ans=0.1 2023-06-26 15:28:29,917 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1598622.0, ans=0.125 2023-06-26 15:28:34,869 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1598622.0, ans=0.2 2023-06-26 15:28:51,371 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1598682.0, ans=0.125 2023-06-26 15:29:00,167 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1598682.0, ans=0.125 2023-06-26 15:29:02,885 INFO [train.py:996] (2/4) Epoch 9, batch 22500, loss[loss=0.2728, simple_loss=0.3616, pruned_loss=0.09197, over 21564.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.285, pruned_loss=0.07058, over 4265832.40 frames. ], batch size: 441, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:29:26,970 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.381e+02 5.166e+02 7.858e+02 1.138e+03 3.264e+03, threshold=1.572e+03, percent-clipped=12.0 2023-06-26 15:30:18,616 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1598922.0, ans=0.125 2023-06-26 15:30:57,538 INFO [train.py:996] (2/4) Epoch 9, batch 22550, loss[loss=0.2145, simple_loss=0.3279, pruned_loss=0.05061, over 20719.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2892, pruned_loss=0.07053, over 4274243.52 frames. ], batch size: 607, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:31:22,777 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1599102.0, ans=0.125 2023-06-26 15:31:39,413 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.45 vs. limit=22.5 2023-06-26 15:32:32,866 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.70 vs. limit=12.0 2023-06-26 15:32:33,741 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1599282.0, ans=0.1 2023-06-26 15:32:49,167 INFO [train.py:996] (2/4) Epoch 9, batch 22600, loss[loss=0.255, simple_loss=0.3449, pruned_loss=0.08254, over 21654.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2936, pruned_loss=0.07078, over 4271418.11 frames. ], batch size: 441, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:33:08,790 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.130e+02 6.886e+02 1.082e+03 1.570e+03 3.521e+03, threshold=2.164e+03, percent-clipped=24.0 2023-06-26 15:33:23,431 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1599402.0, ans=0.0 2023-06-26 15:34:05,341 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1599522.0, ans=0.125 2023-06-26 15:34:37,827 INFO [train.py:996] (2/4) Epoch 9, batch 22650, loss[loss=0.2209, simple_loss=0.2722, pruned_loss=0.08483, over 21527.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2903, pruned_loss=0.0703, over 4275154.33 frames. ], batch size: 441, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:36:24,822 INFO [train.py:996] (2/4) Epoch 9, batch 22700, loss[loss=0.2031, simple_loss=0.267, pruned_loss=0.06959, over 21416.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.2838, pruned_loss=0.06922, over 4272840.38 frames. ], batch size: 389, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:36:38,013 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1599942.0, ans=0.0 2023-06-26 15:36:44,331 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.472e+02 5.506e+02 7.412e+02 1.059e+03 2.032e+03, threshold=1.482e+03, percent-clipped=0.0 2023-06-26 15:36:58,507 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.95 vs. limit=15.0 2023-06-26 15:37:16,046 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1600062.0, ans=0.1 2023-06-26 15:37:21,115 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1600062.0, ans=0.1 2023-06-26 15:37:50,796 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1600122.0, ans=0.125 2023-06-26 15:37:52,713 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.21 vs. limit=12.0 2023-06-26 15:37:58,026 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.17 vs. limit=10.0 2023-06-26 15:38:12,860 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1600242.0, ans=0.125 2023-06-26 15:38:13,904 INFO [train.py:996] (2/4) Epoch 9, batch 22750, loss[loss=0.2082, simple_loss=0.2768, pruned_loss=0.06978, over 21145.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.284, pruned_loss=0.07075, over 4271148.79 frames. ], batch size: 143, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:39:02,422 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1600362.0, ans=0.025 2023-06-26 15:39:21,371 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1600422.0, ans=0.0 2023-06-26 15:39:28,124 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1600422.0, ans=0.0 2023-06-26 15:40:01,407 INFO [train.py:996] (2/4) Epoch 9, batch 22800, loss[loss=0.226, simple_loss=0.2886, pruned_loss=0.08169, over 21654.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2887, pruned_loss=0.0727, over 4273436.41 frames. ], batch size: 263, lr: 3.24e-03, grad_scale: 32.0 2023-06-26 15:40:07,693 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1600542.0, ans=0.125 2023-06-26 15:40:21,591 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1600542.0, ans=0.0 2023-06-26 15:40:28,038 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.571e+02 5.339e+02 7.756e+02 1.140e+03 2.355e+03, threshold=1.551e+03, percent-clipped=14.0 2023-06-26 15:40:42,437 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.51 vs. limit=15.0 2023-06-26 15:40:50,805 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1600662.0, ans=0.125 2023-06-26 15:41:49,496 INFO [train.py:996] (2/4) Epoch 9, batch 22850, loss[loss=0.2063, simple_loss=0.2724, pruned_loss=0.07013, over 15091.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2848, pruned_loss=0.07186, over 4269808.52 frames. ], batch size: 60, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:42:54,042 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1601022.0, ans=0.125 2023-06-26 15:43:21,008 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn2.whiten.whitening_limit, batch_count=1601082.0, ans=22.5 2023-06-26 15:43:22,298 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1601082.0, ans=0.125 2023-06-26 15:43:37,595 INFO [train.py:996] (2/4) Epoch 9, batch 22900, loss[loss=0.2024, simple_loss=0.297, pruned_loss=0.05392, over 21391.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.2845, pruned_loss=0.07094, over 4275748.34 frames. ], batch size: 211, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:44:04,405 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.622e+02 6.448e+02 8.997e+02 1.321e+03 2.993e+03, threshold=1.799e+03, percent-clipped=19.0 2023-06-26 15:44:26,447 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1601262.0, ans=0.125 2023-06-26 15:45:16,671 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1601382.0, ans=0.0 2023-06-26 15:45:19,275 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.91 vs. limit=15.0 2023-06-26 15:45:22,355 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1601382.0, ans=0.125 2023-06-26 15:45:25,764 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1601382.0, ans=0.1 2023-06-26 15:45:28,330 INFO [train.py:996] (2/4) Epoch 9, batch 22950, loss[loss=0.2292, simple_loss=0.3528, pruned_loss=0.05276, over 21795.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.2964, pruned_loss=0.06933, over 4270602.86 frames. ], batch size: 351, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:45:44,612 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1601442.0, ans=0.0 2023-06-26 15:46:06,247 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.40 vs. limit=22.5 2023-06-26 15:46:14,840 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1601562.0, ans=0.2 2023-06-26 15:46:49,288 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1601622.0, ans=0.125 2023-06-26 15:47:10,767 INFO [train.py:996] (2/4) Epoch 9, batch 23000, loss[loss=0.2165, simple_loss=0.2915, pruned_loss=0.07076, over 21905.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2954, pruned_loss=0.06766, over 4274258.59 frames. ], batch size: 371, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:47:35,331 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.06 vs. limit=15.0 2023-06-26 15:47:42,633 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.591e+02 4.526e+02 6.178e+02 9.113e+02 2.510e+03, threshold=1.236e+03, percent-clipped=4.0 2023-06-26 15:48:06,295 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.64 vs. limit=10.0 2023-06-26 15:48:06,331 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.17 vs. limit=15.0 2023-06-26 15:48:37,252 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1601922.0, ans=0.035 2023-06-26 15:48:43,214 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.83 vs. limit=22.5 2023-06-26 15:49:11,848 INFO [train.py:996] (2/4) Epoch 9, batch 23050, loss[loss=0.2279, simple_loss=0.3044, pruned_loss=0.07571, over 21478.00 frames. ], tot_loss[loss=0.2172, simple_loss=0.2959, pruned_loss=0.06926, over 4268517.50 frames. ], batch size: 548, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:49:12,410 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1602042.0, ans=0.125 2023-06-26 15:49:54,562 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1602102.0, ans=0.125 2023-06-26 15:50:01,508 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1602162.0, ans=0.0 2023-06-26 15:50:19,426 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1602222.0, ans=0.0 2023-06-26 15:50:55,769 INFO [train.py:996] (2/4) Epoch 9, batch 23100, loss[loss=0.2036, simple_loss=0.2617, pruned_loss=0.07271, over 21156.00 frames. ], tot_loss[loss=0.216, simple_loss=0.2926, pruned_loss=0.06975, over 4274976.19 frames. ], batch size: 159, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:51:10,354 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 15:51:22,039 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.578e+02 4.836e+02 6.120e+02 9.547e+02 2.287e+03, threshold=1.224e+03, percent-clipped=14.0 2023-06-26 15:51:43,382 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1602462.0, ans=0.125 2023-06-26 15:51:47,822 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.08 vs. limit=15.0 2023-06-26 15:51:54,920 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.45 vs. limit=10.0 2023-06-26 15:52:13,733 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1602522.0, ans=0.125 2023-06-26 15:52:15,343 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1602582.0, ans=0.125 2023-06-26 15:52:43,465 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1602642.0, ans=0.125 2023-06-26 15:52:44,519 INFO [train.py:996] (2/4) Epoch 9, batch 23150, loss[loss=0.2075, simple_loss=0.2809, pruned_loss=0.067, over 21803.00 frames. ], tot_loss[loss=0.2123, simple_loss=0.2866, pruned_loss=0.06902, over 4263688.77 frames. ], batch size: 298, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:52:54,287 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1602642.0, ans=0.2 2023-06-26 15:53:11,985 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.99 vs. limit=6.0 2023-06-26 15:53:57,493 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1602822.0, ans=0.1 2023-06-26 15:54:06,190 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.49 vs. limit=15.0 2023-06-26 15:54:14,519 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1602882.0, ans=0.0 2023-06-26 15:54:25,643 INFO [train.py:996] (2/4) Epoch 9, batch 23200, loss[loss=0.1948, simple_loss=0.2586, pruned_loss=0.06554, over 21689.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.2872, pruned_loss=0.07019, over 4272032.57 frames. ], batch size: 230, lr: 3.24e-03, grad_scale: 32.0 2023-06-26 15:54:43,911 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1602942.0, ans=0.125 2023-06-26 15:54:57,774 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.454e+02 5.096e+02 6.731e+02 1.055e+03 2.311e+03, threshold=1.346e+03, percent-clipped=14.0 2023-06-26 15:55:23,767 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1603062.0, ans=0.0 2023-06-26 15:55:45,317 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.44 vs. limit=12.0 2023-06-26 15:56:14,186 INFO [train.py:996] (2/4) Epoch 9, batch 23250, loss[loss=0.2716, simple_loss=0.3146, pruned_loss=0.1143, over 21796.00 frames. ], tot_loss[loss=0.2147, simple_loss=0.287, pruned_loss=0.07119, over 4284082.28 frames. ], batch size: 508, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:57:52,484 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.04 vs. limit=22.5 2023-06-26 15:58:08,882 INFO [train.py:996] (2/4) Epoch 9, batch 23300, loss[loss=0.3574, simple_loss=0.4382, pruned_loss=0.1383, over 21452.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.2942, pruned_loss=0.07265, over 4290126.79 frames. ], batch size: 507, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:58:37,847 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.694e+02 6.000e+02 9.033e+02 1.405e+03 3.617e+03, threshold=1.807e+03, percent-clipped=26.0 2023-06-26 15:59:06,757 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1603662.0, ans=0.125 2023-06-26 15:59:15,640 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=1603722.0, ans=10.0 2023-06-26 16:00:04,720 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1603842.0, ans=0.125 2023-06-26 16:00:05,628 INFO [train.py:996] (2/4) Epoch 9, batch 23350, loss[loss=0.2008, simple_loss=0.2872, pruned_loss=0.05721, over 21761.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.298, pruned_loss=0.07115, over 4281683.96 frames. ], batch size: 332, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 16:00:43,038 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1603902.0, ans=0.1 2023-06-26 16:01:15,610 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1604022.0, ans=0.125 2023-06-26 16:01:53,515 INFO [train.py:996] (2/4) Epoch 9, batch 23400, loss[loss=0.2179, simple_loss=0.2842, pruned_loss=0.07584, over 21462.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2918, pruned_loss=0.06772, over 4276294.34 frames. ], batch size: 194, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:02:21,756 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.225e+02 5.479e+02 7.119e+02 1.024e+03 2.077e+03, threshold=1.424e+03, percent-clipped=2.0 2023-06-26 16:02:25,970 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1604202.0, ans=0.1 2023-06-26 16:03:04,736 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.56 vs. limit=15.0 2023-06-26 16:03:47,226 INFO [train.py:996] (2/4) Epoch 9, batch 23450, loss[loss=0.2134, simple_loss=0.2895, pruned_loss=0.06866, over 21967.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.2927, pruned_loss=0.0696, over 4269937.64 frames. ], batch size: 316, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:04:10,387 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1604502.0, ans=0.125 2023-06-26 16:04:53,174 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1604622.0, ans=0.125 2023-06-26 16:05:01,333 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1604622.0, ans=0.125 2023-06-26 16:05:28,873 INFO [train.py:996] (2/4) Epoch 9, batch 23500, loss[loss=0.2079, simple_loss=0.2733, pruned_loss=0.07126, over 21457.00 frames. ], tot_loss[loss=0.2167, simple_loss=0.292, pruned_loss=0.07074, over 4271153.89 frames. ], batch size: 194, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:05:31,434 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1604742.0, ans=0.125 2023-06-26 16:05:56,207 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.762e+02 6.548e+02 9.049e+02 1.310e+03 3.325e+03, threshold=1.810e+03, percent-clipped=21.0 2023-06-26 16:06:38,107 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1604922.0, ans=0.1 2023-06-26 16:07:15,845 INFO [train.py:996] (2/4) Epoch 9, batch 23550, loss[loss=0.1868, simple_loss=0.2514, pruned_loss=0.06107, over 21686.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2877, pruned_loss=0.07041, over 4262754.69 frames. ], batch size: 316, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:07:16,528 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1605042.0, ans=0.125 2023-06-26 16:07:33,404 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1605042.0, ans=0.0 2023-06-26 16:07:33,443 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1605042.0, ans=0.1 2023-06-26 16:08:10,498 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1605162.0, ans=0.1 2023-06-26 16:08:20,707 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1605162.0, ans=0.0 2023-06-26 16:08:24,965 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=1605222.0, ans=15.0 2023-06-26 16:09:04,162 INFO [train.py:996] (2/4) Epoch 9, batch 23600, loss[loss=0.2263, simple_loss=0.3055, pruned_loss=0.07357, over 21873.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2886, pruned_loss=0.07074, over 4260542.91 frames. ], batch size: 316, lr: 3.23e-03, grad_scale: 32.0 2023-06-26 16:09:18,164 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.08 vs. limit=15.0 2023-06-26 16:09:32,651 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.846e+02 5.348e+02 7.374e+02 1.134e+03 2.536e+03, threshold=1.475e+03, percent-clipped=3.0 2023-06-26 16:09:36,681 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1605402.0, ans=0.125 2023-06-26 16:09:36,761 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1605402.0, ans=0.0 2023-06-26 16:10:18,628 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1605522.0, ans=0.1 2023-06-26 16:10:25,776 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1605522.0, ans=0.125 2023-06-26 16:10:55,366 INFO [train.py:996] (2/4) Epoch 9, batch 23650, loss[loss=0.2041, simple_loss=0.2801, pruned_loss=0.06405, over 21328.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2882, pruned_loss=0.06929, over 4262273.43 frames. ], batch size: 159, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:11:15,109 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_ff2.min_abs, batch_count=1605702.0, ans=0.1 2023-06-26 16:11:27,041 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1605702.0, ans=0.0 2023-06-26 16:12:05,215 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1605822.0, ans=0.0 2023-06-26 16:12:37,460 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1605882.0, ans=0.04949747468305833 2023-06-26 16:12:43,732 INFO [train.py:996] (2/4) Epoch 9, batch 23700, loss[loss=0.2362, simple_loss=0.3143, pruned_loss=0.07907, over 21741.00 frames. ], tot_loss[loss=0.2147, simple_loss=0.2911, pruned_loss=0.06913, over 4265724.79 frames. ], batch size: 441, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:12:46,333 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1605942.0, ans=0.125 2023-06-26 16:13:18,789 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.291e+02 4.703e+02 6.208e+02 8.925e+02 2.253e+03, threshold=1.242e+03, percent-clipped=5.0 2023-06-26 16:13:47,888 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1606062.0, ans=0.0 2023-06-26 16:13:50,607 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.27 vs. limit=22.5 2023-06-26 16:14:10,901 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1606122.0, ans=0.125 2023-06-26 16:14:11,024 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1606122.0, ans=0.025 2023-06-26 16:14:27,311 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1606182.0, ans=0.0 2023-06-26 16:14:33,506 INFO [train.py:996] (2/4) Epoch 9, batch 23750, loss[loss=0.2083, simple_loss=0.3081, pruned_loss=0.05426, over 21737.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2947, pruned_loss=0.06976, over 4268465.71 frames. ], batch size: 351, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:14:53,153 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1606242.0, ans=0.2 2023-06-26 16:16:00,208 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1606422.0, ans=0.05 2023-06-26 16:16:05,387 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1606482.0, ans=0.0 2023-06-26 16:16:14,937 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.51 vs. limit=15.0 2023-06-26 16:16:23,106 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1606482.0, ans=0.0 2023-06-26 16:16:27,407 INFO [train.py:996] (2/4) Epoch 9, batch 23800, loss[loss=0.2442, simple_loss=0.3368, pruned_loss=0.07582, over 21805.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2935, pruned_loss=0.06778, over 4268224.60 frames. ], batch size: 282, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:17:04,099 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.330e+02 5.235e+02 7.849e+02 1.092e+03 2.188e+03, threshold=1.570e+03, percent-clipped=19.0 2023-06-26 16:18:13,450 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1606782.0, ans=0.125 2023-06-26 16:18:29,063 INFO [train.py:996] (2/4) Epoch 9, batch 23850, loss[loss=0.2513, simple_loss=0.3628, pruned_loss=0.06994, over 19727.00 frames. ], tot_loss[loss=0.22, simple_loss=0.3012, pruned_loss=0.06935, over 4269818.23 frames. ], batch size: 702, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:18:36,764 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1606842.0, ans=0.125 2023-06-26 16:20:12,451 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.00 vs. limit=15.0 2023-06-26 16:20:16,520 INFO [train.py:996] (2/4) Epoch 9, batch 23900, loss[loss=0.267, simple_loss=0.3742, pruned_loss=0.07985, over 21616.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.3076, pruned_loss=0.07163, over 4263832.59 frames. ], batch size: 414, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:20:45,482 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.978e+02 7.206e+02 9.928e+02 1.468e+03 4.059e+03, threshold=1.986e+03, percent-clipped=20.0 2023-06-26 16:21:02,657 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1607262.0, ans=0.0 2023-06-26 16:22:02,571 INFO [train.py:996] (2/4) Epoch 9, batch 23950, loss[loss=0.2037, simple_loss=0.2762, pruned_loss=0.06563, over 21734.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.3016, pruned_loss=0.07099, over 4261505.58 frames. ], batch size: 282, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:22:41,800 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.80 vs. limit=15.0 2023-06-26 16:23:34,826 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.93 vs. limit=15.0 2023-06-26 16:23:50,748 INFO [train.py:996] (2/4) Epoch 9, batch 24000, loss[loss=0.2508, simple_loss=0.3378, pruned_loss=0.0819, over 21457.00 frames. ], tot_loss[loss=0.2248, simple_loss=0.3029, pruned_loss=0.07337, over 4263443.08 frames. ], batch size: 131, lr: 3.23e-03, grad_scale: 32.0 2023-06-26 16:23:50,749 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-26 16:24:10,709 INFO [train.py:1028] (2/4) Epoch 9, validation: loss=0.2632, simple_loss=0.3589, pruned_loss=0.0837, over 1796401.00 frames. 2023-06-26 16:24:10,710 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 23731MB 2023-06-26 16:24:36,343 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.989e+02 5.715e+02 7.802e+02 1.213e+03 2.324e+03, threshold=1.560e+03, percent-clipped=4.0 2023-06-26 16:25:00,864 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.43 vs. limit=15.0 2023-06-26 16:25:40,385 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1607982.0, ans=0.125 2023-06-26 16:26:00,889 INFO [train.py:996] (2/4) Epoch 9, batch 24050, loss[loss=0.1926, simple_loss=0.2874, pruned_loss=0.04889, over 21824.00 frames. ], tot_loss[loss=0.226, simple_loss=0.3042, pruned_loss=0.0739, over 4265225.69 frames. ], batch size: 282, lr: 3.23e-03, grad_scale: 32.0 2023-06-26 16:26:03,241 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1608042.0, ans=0.0 2023-06-26 16:26:55,621 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1608162.0, ans=0.05 2023-06-26 16:27:19,187 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1608222.0, ans=0.0 2023-06-26 16:27:32,798 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1608282.0, ans=0.015 2023-06-26 16:27:49,848 INFO [train.py:996] (2/4) Epoch 9, batch 24100, loss[loss=0.2704, simple_loss=0.3522, pruned_loss=0.09429, over 21574.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.3044, pruned_loss=0.07229, over 4263187.33 frames. ], batch size: 414, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:27:59,648 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1608342.0, ans=0.125 2023-06-26 16:28:15,341 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1608402.0, ans=0.1 2023-06-26 16:28:27,568 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.963e+02 5.200e+02 7.145e+02 1.046e+03 2.381e+03, threshold=1.429e+03, percent-clipped=3.0 2023-06-26 16:28:33,050 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1608402.0, ans=0.125 2023-06-26 16:29:24,444 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.68 vs. limit=22.5 2023-06-26 16:29:39,230 INFO [train.py:996] (2/4) Epoch 9, batch 24150, loss[loss=0.1951, simple_loss=0.2574, pruned_loss=0.06642, over 21169.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.3033, pruned_loss=0.0737, over 4266129.15 frames. ], batch size: 608, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:30:17,176 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1608702.0, ans=0.125 2023-06-26 16:31:29,817 INFO [train.py:996] (2/4) Epoch 9, batch 24200, loss[loss=0.2131, simple_loss=0.2895, pruned_loss=0.06838, over 21437.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.3065, pruned_loss=0.07556, over 4272010.65 frames. ], batch size: 195, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:32:12,924 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.636e+02 5.699e+02 8.049e+02 1.259e+03 2.421e+03, threshold=1.610e+03, percent-clipped=17.0 2023-06-26 16:32:45,293 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1609122.0, ans=0.0 2023-06-26 16:32:47,009 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1609122.0, ans=0.2 2023-06-26 16:32:50,318 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1609122.0, ans=0.1 2023-06-26 16:33:31,013 INFO [train.py:996] (2/4) Epoch 9, batch 24250, loss[loss=0.1733, simple_loss=0.2709, pruned_loss=0.03788, over 21840.00 frames. ], tot_loss[loss=0.2214, simple_loss=0.3037, pruned_loss=0.06959, over 4265351.58 frames. ], batch size: 316, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:34:41,975 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.96 vs. limit=6.0 2023-06-26 16:35:18,865 INFO [train.py:996] (2/4) Epoch 9, batch 24300, loss[loss=0.1932, simple_loss=0.2762, pruned_loss=0.05512, over 21647.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.2966, pruned_loss=0.0649, over 4269271.73 frames. ], batch size: 389, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:35:50,238 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.136e+02 4.084e+02 7.233e+02 1.324e+03 4.143e+03, threshold=1.447e+03, percent-clipped=16.0 2023-06-26 16:37:07,362 INFO [train.py:996] (2/4) Epoch 9, batch 24350, loss[loss=0.2047, simple_loss=0.2784, pruned_loss=0.06548, over 21475.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.2937, pruned_loss=0.06591, over 4272194.40 frames. ], batch size: 194, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:37:08,014 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1609842.0, ans=0.125 2023-06-26 16:37:27,497 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1609842.0, ans=0.0 2023-06-26 16:37:59,626 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1609962.0, ans=0.1 2023-06-26 16:38:01,553 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1609962.0, ans=0.1 2023-06-26 16:38:44,543 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1610082.0, ans=0.0 2023-06-26 16:39:02,393 INFO [train.py:996] (2/4) Epoch 9, batch 24400, loss[loss=0.2098, simple_loss=0.2977, pruned_loss=0.06094, over 21612.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.2975, pruned_loss=0.06895, over 4273795.51 frames. ], batch size: 230, lr: 3.23e-03, grad_scale: 32.0 2023-06-26 16:39:34,021 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.576e+02 5.148e+02 6.716e+02 1.029e+03 2.743e+03, threshold=1.343e+03, percent-clipped=7.0 2023-06-26 16:39:34,697 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1610202.0, ans=0.125 2023-06-26 16:39:36,615 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1610202.0, ans=0.2 2023-06-26 16:40:05,487 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1610322.0, ans=0.0 2023-06-26 16:40:46,236 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1610382.0, ans=0.2 2023-06-26 16:40:49,844 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1610382.0, ans=0.125 2023-06-26 16:40:52,893 INFO [train.py:996] (2/4) Epoch 9, batch 24450, loss[loss=0.1724, simple_loss=0.2473, pruned_loss=0.0487, over 16326.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.2998, pruned_loss=0.07018, over 4266699.93 frames. ], batch size: 63, lr: 3.23e-03, grad_scale: 32.0 2023-06-26 16:41:57,573 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1610622.0, ans=0.125 2023-06-26 16:42:26,322 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1610682.0, ans=0.04949747468305833 2023-06-26 16:42:41,539 INFO [train.py:996] (2/4) Epoch 9, batch 24500, loss[loss=0.1853, simple_loss=0.2667, pruned_loss=0.05194, over 21642.00 frames. ], tot_loss[loss=0.219, simple_loss=0.299, pruned_loss=0.06951, over 4272154.66 frames. ], batch size: 263, lr: 3.23e-03, grad_scale: 32.0 2023-06-26 16:42:54,459 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1610742.0, ans=0.2 2023-06-26 16:43:14,699 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.443e+02 5.135e+02 6.610e+02 1.095e+03 2.710e+03, threshold=1.322e+03, percent-clipped=12.0 2023-06-26 16:43:49,383 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1610922.0, ans=0.0 2023-06-26 16:44:08,020 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1610922.0, ans=0.0 2023-06-26 16:44:35,193 INFO [train.py:996] (2/4) Epoch 9, batch 24550, loss[loss=0.27, simple_loss=0.3334, pruned_loss=0.1033, over 21449.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.3002, pruned_loss=0.07082, over 4274191.82 frames. ], batch size: 471, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:45:15,861 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1611162.0, ans=0.125 2023-06-26 16:45:36,617 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1611222.0, ans=0.0 2023-06-26 16:45:39,693 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1611222.0, ans=0.015 2023-06-26 16:46:16,637 INFO [train.py:996] (2/4) Epoch 9, batch 24600, loss[loss=0.1806, simple_loss=0.2447, pruned_loss=0.05829, over 21783.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.2967, pruned_loss=0.07173, over 4267614.98 frames. ], batch size: 112, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:46:48,951 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.731e+02 5.495e+02 6.731e+02 9.246e+02 1.741e+03, threshold=1.346e+03, percent-clipped=6.0 2023-06-26 16:47:19,865 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1611462.0, ans=0.1 2023-06-26 16:47:49,987 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1611582.0, ans=0.125 2023-06-26 16:48:05,295 INFO [train.py:996] (2/4) Epoch 9, batch 24650, loss[loss=0.1758, simple_loss=0.2387, pruned_loss=0.05642, over 21594.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.2886, pruned_loss=0.07045, over 4269764.71 frames. ], batch size: 231, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:49:45,335 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1611882.0, ans=0.07 2023-06-26 16:49:58,461 INFO [train.py:996] (2/4) Epoch 9, batch 24700, loss[loss=0.2281, simple_loss=0.284, pruned_loss=0.08615, over 21369.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.2889, pruned_loss=0.06872, over 4272919.46 frames. ], batch size: 473, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:50:08,059 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.96 vs. limit=15.0 2023-06-26 16:50:31,842 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.355e+02 4.942e+02 6.984e+02 9.406e+02 2.267e+03, threshold=1.397e+03, percent-clipped=8.0 2023-06-26 16:51:23,442 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1612182.0, ans=0.1 2023-06-26 16:51:33,587 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1612182.0, ans=0.125 2023-06-26 16:51:37,242 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1612182.0, ans=0.125 2023-06-26 16:51:46,671 INFO [train.py:996] (2/4) Epoch 9, batch 24750, loss[loss=0.1732, simple_loss=0.251, pruned_loss=0.04767, over 21671.00 frames. ], tot_loss[loss=0.2079, simple_loss=0.2825, pruned_loss=0.0666, over 4272456.51 frames. ], batch size: 298, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:52:33,867 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1612362.0, ans=0.125 2023-06-26 16:52:52,479 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1612362.0, ans=0.1 2023-06-26 16:53:06,980 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.70 vs. limit=15.0 2023-06-26 16:53:29,868 INFO [train.py:996] (2/4) Epoch 9, batch 24800, loss[loss=0.2376, simple_loss=0.2999, pruned_loss=0.08768, over 21565.00 frames. ], tot_loss[loss=0.2056, simple_loss=0.2788, pruned_loss=0.06621, over 4277446.33 frames. ], batch size: 471, lr: 3.23e-03, grad_scale: 32.0 2023-06-26 16:54:10,077 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.499e+02 5.335e+02 8.218e+02 1.489e+03 3.682e+03, threshold=1.644e+03, percent-clipped=29.0 2023-06-26 16:54:22,938 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1612662.0, ans=0.125 2023-06-26 16:54:29,695 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1612662.0, ans=0.125 2023-06-26 16:54:44,453 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.24 vs. limit=15.0 2023-06-26 16:54:53,016 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1612722.0, ans=0.0 2023-06-26 16:55:20,281 INFO [train.py:996] (2/4) Epoch 9, batch 24850, loss[loss=0.1722, simple_loss=0.2386, pruned_loss=0.05285, over 21361.00 frames. ], tot_loss[loss=0.2079, simple_loss=0.2799, pruned_loss=0.06794, over 4282367.51 frames. ], batch size: 176, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:55:29,430 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1612842.0, ans=0.1 2023-06-26 16:55:31,077 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1612842.0, ans=0.0 2023-06-26 16:55:34,602 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1612842.0, ans=0.0 2023-06-26 16:55:57,772 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1612902.0, ans=0.0 2023-06-26 16:56:53,424 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.06 vs. limit=22.5 2023-06-26 16:57:14,600 INFO [train.py:996] (2/4) Epoch 9, batch 24900, loss[loss=0.2897, simple_loss=0.3555, pruned_loss=0.112, over 21410.00 frames. ], tot_loss[loss=0.211, simple_loss=0.2836, pruned_loss=0.0692, over 4279748.32 frames. ], batch size: 471, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:57:52,263 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1613202.0, ans=0.125 2023-06-26 16:57:54,891 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.038e+02 5.408e+02 8.463e+02 1.347e+03 2.375e+03, threshold=1.693e+03, percent-clipped=14.0 2023-06-26 16:57:59,337 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1613262.0, ans=0.125 2023-06-26 16:58:14,530 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1613262.0, ans=0.0 2023-06-26 16:59:11,115 INFO [train.py:996] (2/4) Epoch 9, batch 24950, loss[loss=0.2253, simple_loss=0.3006, pruned_loss=0.07501, over 21816.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.2905, pruned_loss=0.07257, over 4280190.83 frames. ], batch size: 247, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:59:52,566 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1613502.0, ans=0.125 2023-06-26 17:00:13,841 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1613622.0, ans=0.1 2023-06-26 17:00:59,210 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1613742.0, ans=0.0 2023-06-26 17:01:05,687 INFO [train.py:996] (2/4) Epoch 9, batch 25000, loss[loss=0.2071, simple_loss=0.2801, pruned_loss=0.06707, over 22000.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.2968, pruned_loss=0.07412, over 4283775.50 frames. ], batch size: 103, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 17:01:18,024 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1613742.0, ans=0.125 2023-06-26 17:01:35,523 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1613802.0, ans=0.04949747468305833 2023-06-26 17:01:40,151 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.080e+02 5.287e+02 8.385e+02 1.349e+03 3.356e+03, threshold=1.677e+03, percent-clipped=10.0 2023-06-26 17:01:49,704 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1613862.0, ans=0.1 2023-06-26 17:02:00,588 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.36 vs. limit=6.0 2023-06-26 17:02:12,021 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1613922.0, ans=0.125 2023-06-26 17:02:38,706 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.12 vs. limit=22.5 2023-06-26 17:02:52,731 INFO [train.py:996] (2/4) Epoch 9, batch 25050, loss[loss=0.1737, simple_loss=0.2429, pruned_loss=0.05219, over 21580.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.2898, pruned_loss=0.07261, over 4277699.98 frames. ], batch size: 213, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:02:55,184 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1614042.0, ans=10.0 2023-06-26 17:03:10,674 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1614102.0, ans=0.0 2023-06-26 17:03:49,547 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1614162.0, ans=0.2 2023-06-26 17:04:08,387 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1614222.0, ans=0.05 2023-06-26 17:04:40,851 INFO [train.py:996] (2/4) Epoch 9, batch 25100, loss[loss=0.2547, simple_loss=0.3264, pruned_loss=0.09153, over 21448.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.2848, pruned_loss=0.07145, over 4281166.70 frames. ], batch size: 508, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:04:43,157 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1614342.0, ans=0.1 2023-06-26 17:05:01,869 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1614402.0, ans=0.05 2023-06-26 17:05:15,428 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.352e+02 5.789e+02 8.437e+02 1.364e+03 2.592e+03, threshold=1.687e+03, percent-clipped=13.0 2023-06-26 17:06:16,731 INFO [train.py:996] (2/4) Epoch 9, batch 25150, loss[loss=0.1933, simple_loss=0.2846, pruned_loss=0.05101, over 21782.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.2874, pruned_loss=0.06938, over 4264494.73 frames. ], batch size: 298, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:07:13,786 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 17:07:29,334 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1614822.0, ans=0.125 2023-06-26 17:08:05,180 INFO [train.py:996] (2/4) Epoch 9, batch 25200, loss[loss=0.221, simple_loss=0.3104, pruned_loss=0.06581, over 21745.00 frames. ], tot_loss[loss=0.2105, simple_loss=0.2867, pruned_loss=0.06716, over 4254901.62 frames. ], batch size: 351, lr: 3.22e-03, grad_scale: 32.0 2023-06-26 17:08:41,003 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.90 vs. limit=22.5 2023-06-26 17:08:44,107 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1615002.0, ans=0.125 2023-06-26 17:08:50,509 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.256e+02 4.732e+02 7.162e+02 1.048e+03 3.410e+03, threshold=1.432e+03, percent-clipped=11.0 2023-06-26 17:09:01,215 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1615062.0, ans=0.125 2023-06-26 17:09:33,464 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1615182.0, ans=0.125 2023-06-26 17:09:52,310 INFO [train.py:996] (2/4) Epoch 9, batch 25250, loss[loss=0.2203, simple_loss=0.2893, pruned_loss=0.07564, over 22010.00 frames. ], tot_loss[loss=0.2094, simple_loss=0.2862, pruned_loss=0.06627, over 4255143.69 frames. ], batch size: 103, lr: 3.22e-03, grad_scale: 32.0 2023-06-26 17:09:53,485 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=8.33 vs. limit=12.0 2023-06-26 17:10:33,649 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1615302.0, ans=0.0 2023-06-26 17:11:09,023 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1615422.0, ans=0.0 2023-06-26 17:11:38,453 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1615542.0, ans=0.95 2023-06-26 17:11:39,536 INFO [train.py:996] (2/4) Epoch 9, batch 25300, loss[loss=0.2188, simple_loss=0.2972, pruned_loss=0.07015, over 21724.00 frames. ], tot_loss[loss=0.2073, simple_loss=0.2832, pruned_loss=0.06564, over 4257438.17 frames. ], batch size: 351, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:12:00,480 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1615542.0, ans=0.0 2023-06-26 17:12:14,240 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1615602.0, ans=0.1 2023-06-26 17:12:22,278 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.284e+02 5.811e+02 7.982e+02 1.248e+03 2.930e+03, threshold=1.596e+03, percent-clipped=17.0 2023-06-26 17:12:48,357 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1615722.0, ans=0.125 2023-06-26 17:13:09,697 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1615722.0, ans=0.1 2023-06-26 17:13:16,421 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1615782.0, ans=0.125 2023-06-26 17:13:29,742 INFO [train.py:996] (2/4) Epoch 9, batch 25350, loss[loss=0.2346, simple_loss=0.3254, pruned_loss=0.07189, over 21301.00 frames. ], tot_loss[loss=0.2091, simple_loss=0.2866, pruned_loss=0.06579, over 4261028.72 frames. ], batch size: 549, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:13:56,547 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1615902.0, ans=0.035 2023-06-26 17:14:05,479 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 17:14:33,785 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1615962.0, ans=0.1 2023-06-26 17:15:01,406 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.98 vs. limit=6.0 2023-06-26 17:15:17,099 INFO [train.py:996] (2/4) Epoch 9, batch 25400, loss[loss=0.2015, simple_loss=0.27, pruned_loss=0.06647, over 21861.00 frames. ], tot_loss[loss=0.2065, simple_loss=0.2828, pruned_loss=0.06513, over 4255110.61 frames. ], batch size: 107, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:15:42,440 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1616142.0, ans=0.0 2023-06-26 17:15:58,560 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.261e+02 5.059e+02 8.454e+02 1.158e+03 2.444e+03, threshold=1.691e+03, percent-clipped=8.0 2023-06-26 17:16:21,105 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1616262.0, ans=0.0 2023-06-26 17:16:24,653 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1616322.0, ans=0.0 2023-06-26 17:16:41,693 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1616322.0, ans=0.125 2023-06-26 17:16:57,635 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1616382.0, ans=0.1 2023-06-26 17:16:59,239 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1616382.0, ans=0.95 2023-06-26 17:17:04,393 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1616442.0, ans=0.2 2023-06-26 17:17:05,787 INFO [train.py:996] (2/4) Epoch 9, batch 25450, loss[loss=0.2174, simple_loss=0.3065, pruned_loss=0.06417, over 21698.00 frames. ], tot_loss[loss=0.2076, simple_loss=0.2828, pruned_loss=0.06621, over 4258808.63 frames. ], batch size: 263, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:17:13,566 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1616442.0, ans=0.0 2023-06-26 17:17:28,273 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1616442.0, ans=0.04949747468305833 2023-06-26 17:17:43,943 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1616502.0, ans=0.125 2023-06-26 17:17:53,375 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1616502.0, ans=0.125 2023-06-26 17:17:58,555 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1616562.0, ans=0.0 2023-06-26 17:18:00,273 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1616562.0, ans=0.125 2023-06-26 17:18:05,796 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1616562.0, ans=0.2 2023-06-26 17:18:35,429 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.39 vs. limit=6.0 2023-06-26 17:18:55,043 INFO [train.py:996] (2/4) Epoch 9, batch 25500, loss[loss=0.2046, simple_loss=0.2823, pruned_loss=0.06345, over 21315.00 frames. ], tot_loss[loss=0.2043, simple_loss=0.2826, pruned_loss=0.06296, over 4266615.03 frames. ], batch size: 159, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:19:30,321 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.68 vs. limit=15.0 2023-06-26 17:19:43,311 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.118e+02 5.222e+02 7.710e+02 1.108e+03 2.263e+03, threshold=1.542e+03, percent-clipped=6.0 2023-06-26 17:19:45,529 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1616862.0, ans=0.1 2023-06-26 17:19:55,037 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1616862.0, ans=0.0 2023-06-26 17:20:32,684 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1616982.0, ans=0.125 2023-06-26 17:20:48,357 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1616982.0, ans=0.125 2023-06-26 17:20:56,310 INFO [train.py:996] (2/4) Epoch 9, batch 25550, loss[loss=0.2072, simple_loss=0.3091, pruned_loss=0.05272, over 21764.00 frames. ], tot_loss[loss=0.2078, simple_loss=0.2893, pruned_loss=0.06312, over 4271212.30 frames. ], batch size: 332, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:21:10,015 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1617042.0, ans=0.125 2023-06-26 17:21:33,584 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1617102.0, ans=0.0 2023-06-26 17:21:41,212 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.72 vs. limit=15.0 2023-06-26 17:21:57,869 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1617162.0, ans=0.09899494936611666 2023-06-26 17:22:46,557 INFO [train.py:996] (2/4) Epoch 9, batch 25600, loss[loss=0.2294, simple_loss=0.3089, pruned_loss=0.07491, over 21648.00 frames. ], tot_loss[loss=0.2113, simple_loss=0.2939, pruned_loss=0.06429, over 4272511.12 frames. ], batch size: 263, lr: 3.22e-03, grad_scale: 32.0 2023-06-26 17:23:10,932 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1617402.0, ans=0.125 2023-06-26 17:23:26,842 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1617402.0, ans=0.125 2023-06-26 17:23:28,483 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1617402.0, ans=0.125 2023-06-26 17:23:29,862 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.761e+02 5.184e+02 7.757e+02 1.041e+03 2.426e+03, threshold=1.551e+03, percent-clipped=8.0 2023-06-26 17:23:52,945 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1617522.0, ans=0.125 2023-06-26 17:24:15,923 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1617522.0, ans=0.125 2023-06-26 17:24:36,594 INFO [train.py:996] (2/4) Epoch 9, batch 25650, loss[loss=0.2076, simple_loss=0.274, pruned_loss=0.0706, over 15607.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.2928, pruned_loss=0.06605, over 4261317.22 frames. ], batch size: 60, lr: 3.22e-03, grad_scale: 32.0 2023-06-26 17:25:46,984 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1617822.0, ans=0.1 2023-06-26 17:26:07,567 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1617882.0, ans=0.125 2023-06-26 17:26:12,750 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 17:26:24,648 INFO [train.py:996] (2/4) Epoch 9, batch 25700, loss[loss=0.1968, simple_loss=0.2722, pruned_loss=0.06065, over 21589.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.2906, pruned_loss=0.06727, over 4259989.91 frames. ], batch size: 263, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:26:27,117 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1617942.0, ans=0.125 2023-06-26 17:27:02,651 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1618002.0, ans=0.2 2023-06-26 17:27:08,524 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=15.95 vs. limit=15.0 2023-06-26 17:27:08,749 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.793e+02 5.331e+02 7.573e+02 1.078e+03 3.200e+03, threshold=1.515e+03, percent-clipped=12.0 2023-06-26 17:27:26,030 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1618062.0, ans=0.0 2023-06-26 17:27:43,932 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1618122.0, ans=0.2 2023-06-26 17:28:21,570 INFO [train.py:996] (2/4) Epoch 9, batch 25750, loss[loss=0.2583, simple_loss=0.3462, pruned_loss=0.0852, over 21777.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.2956, pruned_loss=0.07038, over 4269086.15 frames. ], batch size: 247, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:28:45,694 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1618302.0, ans=0.125 2023-06-26 17:28:58,282 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1618302.0, ans=0.125 2023-06-26 17:29:01,859 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1618362.0, ans=0.015 2023-06-26 17:29:12,499 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1618362.0, ans=0.125 2023-06-26 17:29:41,417 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=1618422.0, ans=0.5 2023-06-26 17:29:54,339 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=1618482.0, ans=10.0 2023-06-26 17:29:56,182 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1618482.0, ans=0.2 2023-06-26 17:30:18,706 INFO [train.py:996] (2/4) Epoch 9, batch 25800, loss[loss=0.2652, simple_loss=0.3575, pruned_loss=0.08643, over 21389.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.3077, pruned_loss=0.07406, over 4273862.19 frames. ], batch size: 131, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:31:03,952 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.718e+02 5.908e+02 7.803e+02 1.133e+03 2.789e+03, threshold=1.561e+03, percent-clipped=14.0 2023-06-26 17:31:09,561 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1618662.0, ans=0.125 2023-06-26 17:31:13,867 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.18 vs. limit=22.5 2023-06-26 17:31:34,111 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1618722.0, ans=0.125 2023-06-26 17:32:00,625 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1618782.0, ans=0.1 2023-06-26 17:32:08,626 INFO [train.py:996] (2/4) Epoch 9, batch 25850, loss[loss=0.2405, simple_loss=0.3214, pruned_loss=0.07978, over 21757.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.3085, pruned_loss=0.07315, over 4280050.11 frames. ], batch size: 112, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:32:18,102 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1618842.0, ans=0.1 2023-06-26 17:32:49,962 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1618902.0, ans=0.0 2023-06-26 17:33:24,839 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 17:33:27,637 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.11 vs. limit=12.0 2023-06-26 17:34:03,379 INFO [train.py:996] (2/4) Epoch 9, batch 25900, loss[loss=0.2796, simple_loss=0.3758, pruned_loss=0.0917, over 21717.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.3095, pruned_loss=0.07411, over 4279654.89 frames. ], batch size: 389, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:34:12,858 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1619142.0, ans=0.125 2023-06-26 17:34:25,363 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1619202.0, ans=0.0 2023-06-26 17:34:27,633 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.72 vs. limit=22.5 2023-06-26 17:34:47,604 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.581e+02 5.400e+02 8.685e+02 1.109e+03 2.488e+03, threshold=1.737e+03, percent-clipped=11.0 2023-06-26 17:35:26,736 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.06 vs. limit=15.0 2023-06-26 17:35:50,763 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1619382.0, ans=0.2 2023-06-26 17:35:58,966 INFO [train.py:996] (2/4) Epoch 9, batch 25950, loss[loss=0.2358, simple_loss=0.3096, pruned_loss=0.08096, over 21331.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.3154, pruned_loss=0.0764, over 4275338.65 frames. ], batch size: 549, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:36:03,226 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1619442.0, ans=0.125 2023-06-26 17:36:34,250 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.13 vs. limit=15.0 2023-06-26 17:37:08,281 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1619622.0, ans=0.0 2023-06-26 17:37:49,206 INFO [train.py:996] (2/4) Epoch 9, batch 26000, loss[loss=0.2028, simple_loss=0.2966, pruned_loss=0.05452, over 21386.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.3142, pruned_loss=0.07465, over 4274196.34 frames. ], batch size: 211, lr: 3.22e-03, grad_scale: 32.0 2023-06-26 17:38:33,555 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.565e+02 5.045e+02 5.850e+02 7.861e+02 1.944e+03, threshold=1.170e+03, percent-clipped=2.0 2023-06-26 17:38:44,474 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1619862.0, ans=0.0 2023-06-26 17:39:26,237 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1619982.0, ans=0.125 2023-06-26 17:39:37,973 INFO [train.py:996] (2/4) Epoch 9, batch 26050, loss[loss=0.1895, simple_loss=0.2503, pruned_loss=0.06439, over 21129.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.3131, pruned_loss=0.0759, over 4281425.71 frames. ], batch size: 608, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:39:43,649 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1620042.0, ans=0.0 2023-06-26 17:40:00,017 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1620102.0, ans=0.2 2023-06-26 17:40:32,942 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=6.42 vs. limit=22.5 2023-06-26 17:40:41,492 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1620222.0, ans=0.07 2023-06-26 17:41:18,801 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.29 vs. limit=15.0 2023-06-26 17:41:21,074 INFO [train.py:996] (2/4) Epoch 9, batch 26100, loss[loss=0.2446, simple_loss=0.2985, pruned_loss=0.09534, over 21678.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.3078, pruned_loss=0.07566, over 4283919.47 frames. ], batch size: 473, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:41:28,799 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1620342.0, ans=0.125 2023-06-26 17:42:06,466 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.994e+02 5.585e+02 7.440e+02 1.140e+03 2.110e+03, threshold=1.488e+03, percent-clipped=23.0 2023-06-26 17:42:40,863 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 17:42:44,193 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1620522.0, ans=0.125 2023-06-26 17:42:49,560 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1620582.0, ans=0.125 2023-06-26 17:42:50,288 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.22 vs. limit=15.0 2023-06-26 17:43:04,901 INFO [train.py:996] (2/4) Epoch 9, batch 26150, loss[loss=0.2259, simple_loss=0.3002, pruned_loss=0.07578, over 21704.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.305, pruned_loss=0.07583, over 4287802.76 frames. ], batch size: 351, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:43:44,115 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.33 vs. limit=22.5 2023-06-26 17:43:45,213 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1620702.0, ans=0.5 2023-06-26 17:43:45,322 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1620702.0, ans=0.04949747468305833 2023-06-26 17:43:46,995 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1620702.0, ans=0.0 2023-06-26 17:43:48,942 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1620702.0, ans=0.1 2023-06-26 17:44:16,135 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1620822.0, ans=0.1 2023-06-26 17:45:00,356 INFO [train.py:996] (2/4) Epoch 9, batch 26200, loss[loss=0.2346, simple_loss=0.3414, pruned_loss=0.06393, over 21855.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.3063, pruned_loss=0.07434, over 4280217.07 frames. ], batch size: 371, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:45:26,929 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.60 vs. limit=12.0 2023-06-26 17:45:41,652 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.840e+02 5.161e+02 8.097e+02 1.241e+03 2.329e+03, threshold=1.619e+03, percent-clipped=17.0 2023-06-26 17:46:32,252 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1621182.0, ans=0.025 2023-06-26 17:46:56,620 INFO [train.py:996] (2/4) Epoch 9, batch 26250, loss[loss=0.2297, simple_loss=0.3044, pruned_loss=0.07749, over 21784.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3096, pruned_loss=0.073, over 4286697.82 frames. ], batch size: 441, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:47:02,401 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1621242.0, ans=0.125 2023-06-26 17:48:44,888 INFO [train.py:996] (2/4) Epoch 9, batch 26300, loss[loss=0.2091, simple_loss=0.2822, pruned_loss=0.06803, over 21692.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.3066, pruned_loss=0.07341, over 4288951.57 frames. ], batch size: 230, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:49:10,262 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1621602.0, ans=0.1 2023-06-26 17:49:13,808 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1621602.0, ans=0.0 2023-06-26 17:49:13,916 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1621602.0, ans=0.2 2023-06-26 17:49:25,503 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.773e+02 5.088e+02 7.206e+02 1.171e+03 1.823e+03, threshold=1.441e+03, percent-clipped=7.0 2023-06-26 17:49:53,060 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1621722.0, ans=0.0 2023-06-26 17:50:10,677 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1621782.0, ans=0.0 2023-06-26 17:50:34,492 INFO [train.py:996] (2/4) Epoch 9, batch 26350, loss[loss=0.2414, simple_loss=0.3211, pruned_loss=0.0809, over 21862.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.3057, pruned_loss=0.07403, over 4291929.68 frames. ], batch size: 371, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:52:19,479 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1622082.0, ans=0.125 2023-06-26 17:52:23,831 INFO [train.py:996] (2/4) Epoch 9, batch 26400, loss[loss=0.1911, simple_loss=0.2614, pruned_loss=0.06038, over 21756.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.3006, pruned_loss=0.07422, over 4270121.76 frames. ], batch size: 124, lr: 3.22e-03, grad_scale: 32.0 2023-06-26 17:52:35,401 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1622142.0, ans=0.0 2023-06-26 17:52:56,378 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.63 vs. limit=15.0 2023-06-26 17:53:12,758 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.822e+02 5.037e+02 6.959e+02 9.647e+02 1.675e+03, threshold=1.392e+03, percent-clipped=4.0 2023-06-26 17:53:42,983 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.82 vs. limit=15.0 2023-06-26 17:53:59,685 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1622382.0, ans=0.125 2023-06-26 17:54:16,745 INFO [train.py:996] (2/4) Epoch 9, batch 26450, loss[loss=0.2192, simple_loss=0.3161, pruned_loss=0.06112, over 21718.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.3015, pruned_loss=0.07447, over 4260702.11 frames. ], batch size: 298, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:55:29,052 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.52 vs. limit=12.0 2023-06-26 17:55:39,807 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1622622.0, ans=0.125 2023-06-26 17:56:13,581 INFO [train.py:996] (2/4) Epoch 9, batch 26500, loss[loss=0.1794, simple_loss=0.2517, pruned_loss=0.05353, over 21422.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.3035, pruned_loss=0.07318, over 4261606.91 frames. ], batch size: 211, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:56:28,471 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1622742.0, ans=0.125 2023-06-26 17:57:00,792 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1622802.0, ans=0.0 2023-06-26 17:57:07,198 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.802e+02 5.662e+02 1.052e+03 1.637e+03 4.186e+03, threshold=2.103e+03, percent-clipped=36.0 2023-06-26 17:57:43,464 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1622922.0, ans=0.125 2023-06-26 17:58:11,220 INFO [train.py:996] (2/4) Epoch 9, batch 26550, loss[loss=0.1739, simple_loss=0.2524, pruned_loss=0.04768, over 21413.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.3008, pruned_loss=0.07094, over 4255310.48 frames. ], batch size: 194, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:58:11,972 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1623042.0, ans=0.0 2023-06-26 17:59:10,279 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1623162.0, ans=0.2 2023-06-26 17:59:24,879 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1623222.0, ans=0.125 2023-06-26 17:59:33,711 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1623222.0, ans=0.1 2023-06-26 17:59:48,870 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1623282.0, ans=0.2 2023-06-26 18:00:05,320 INFO [train.py:996] (2/4) Epoch 9, batch 26600, loss[loss=0.2027, simple_loss=0.2869, pruned_loss=0.05927, over 21744.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.2999, pruned_loss=0.06891, over 4251935.39 frames. ], batch size: 316, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 18:00:47,582 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.393e+02 5.073e+02 7.169e+02 1.139e+03 3.123e+03, threshold=1.434e+03, percent-clipped=9.0 2023-06-26 18:00:50,726 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.54 vs. limit=22.5 2023-06-26 18:01:33,040 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.63 vs. limit=15.0 2023-06-26 18:01:48,061 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1623582.0, ans=0.125 2023-06-26 18:01:49,547 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1623582.0, ans=0.125 2023-06-26 18:01:59,712 INFO [train.py:996] (2/4) Epoch 9, batch 26650, loss[loss=0.1577, simple_loss=0.2287, pruned_loss=0.04339, over 21795.00 frames. ], tot_loss[loss=0.2147, simple_loss=0.2934, pruned_loss=0.06798, over 4250056.93 frames. ], batch size: 118, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 18:02:53,160 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1623762.0, ans=0.125 2023-06-26 18:03:11,859 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1623822.0, ans=0.125 2023-06-26 18:03:39,927 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1623942.0, ans=0.1 2023-06-26 18:03:40,919 INFO [train.py:996] (2/4) Epoch 9, batch 26700, loss[loss=0.2192, simple_loss=0.2861, pruned_loss=0.07618, over 21319.00 frames. ], tot_loss[loss=0.2082, simple_loss=0.2861, pruned_loss=0.06515, over 4254479.10 frames. ], batch size: 176, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 18:04:03,565 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1624002.0, ans=0.0 2023-06-26 18:04:29,910 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.831e+02 4.080e+02 5.616e+02 9.381e+02 2.662e+03, threshold=1.123e+03, percent-clipped=11.0 2023-06-26 18:05:11,215 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=36.11 vs. limit=15.0 2023-06-26 18:05:33,560 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1624182.0, ans=0.0 2023-06-26 18:05:36,277 INFO [train.py:996] (2/4) Epoch 9, batch 26750, loss[loss=0.2637, simple_loss=0.3392, pruned_loss=0.09414, over 21328.00 frames. ], tot_loss[loss=0.2073, simple_loss=0.2863, pruned_loss=0.06417, over 4266951.88 frames. ], batch size: 143, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 18:05:37,525 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.45 vs. limit=15.0 2023-06-26 18:05:46,611 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.17 vs. limit=12.0 2023-06-26 18:05:52,820 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1624302.0, ans=0.05 2023-06-26 18:05:52,829 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1624302.0, ans=0.125 2023-06-26 18:06:08,343 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.57 vs. limit=15.0 2023-06-26 18:06:14,440 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1624302.0, ans=0.0 2023-06-26 18:07:08,775 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.23 vs. limit=15.0 2023-06-26 18:07:27,091 INFO [train.py:996] (2/4) Epoch 9, batch 26800, loss[loss=0.2138, simple_loss=0.3186, pruned_loss=0.0545, over 20000.00 frames. ], tot_loss[loss=0.2137, simple_loss=0.293, pruned_loss=0.0672, over 4266255.12 frames. ], batch size: 703, lr: 3.21e-03, grad_scale: 32.0 2023-06-26 18:08:15,099 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.607e+02 5.810e+02 7.473e+02 1.088e+03 2.811e+03, threshold=1.495e+03, percent-clipped=19.0 2023-06-26 18:08:47,359 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1624722.0, ans=0.125 2023-06-26 18:09:01,049 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1624782.0, ans=0.125 2023-06-26 18:09:22,006 INFO [train.py:996] (2/4) Epoch 9, batch 26850, loss[loss=0.2018, simple_loss=0.2652, pruned_loss=0.06926, over 22034.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.2945, pruned_loss=0.07025, over 4269960.12 frames. ], batch size: 103, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 18:09:26,578 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.71 vs. limit=12.0 2023-06-26 18:09:58,735 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1624902.0, ans=0.125 2023-06-26 18:10:52,286 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1625082.0, ans=0.125 2023-06-26 18:11:03,297 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1625082.0, ans=0.125 2023-06-26 18:11:09,558 INFO [train.py:996] (2/4) Epoch 9, batch 26900, loss[loss=0.206, simple_loss=0.2687, pruned_loss=0.07167, over 21519.00 frames. ], tot_loss[loss=0.2125, simple_loss=0.2859, pruned_loss=0.06961, over 4272169.26 frames. ], batch size: 391, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 18:11:11,749 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1625142.0, ans=0.125 2023-06-26 18:11:11,826 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1625142.0, ans=0.125 2023-06-26 18:11:41,233 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1625202.0, ans=0.2 2023-06-26 18:11:44,502 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1625202.0, ans=0.0 2023-06-26 18:11:51,260 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1625262.0, ans=0.2 2023-06-26 18:11:51,992 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.46 vs. limit=15.0 2023-06-26 18:11:52,555 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.498e+02 4.462e+02 5.999e+02 9.238e+02 1.607e+03, threshold=1.200e+03, percent-clipped=3.0 2023-06-26 18:12:01,821 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1625262.0, ans=0.04949747468305833 2023-06-26 18:12:38,470 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.92 vs. limit=15.0 2023-06-26 18:12:41,155 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1625382.0, ans=0.125 2023-06-26 18:12:44,645 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1625382.0, ans=0.0 2023-06-26 18:12:57,990 INFO [train.py:996] (2/4) Epoch 9, batch 26950, loss[loss=0.214, simple_loss=0.3007, pruned_loss=0.06362, over 21646.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.2844, pruned_loss=0.06918, over 4277966.69 frames. ], batch size: 247, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 18:13:40,942 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1625562.0, ans=0.125 2023-06-26 18:13:42,819 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1625562.0, ans=0.125 2023-06-26 18:14:12,595 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1625622.0, ans=0.125 2023-06-26 18:14:47,867 INFO [train.py:996] (2/4) Epoch 9, batch 27000, loss[loss=0.2122, simple_loss=0.3087, pruned_loss=0.05781, over 21626.00 frames. ], tot_loss[loss=0.2097, simple_loss=0.2853, pruned_loss=0.06709, over 4271983.96 frames. ], batch size: 389, lr: 3.21e-03, grad_scale: 8.0 2023-06-26 18:14:47,867 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-26 18:15:07,486 INFO [train.py:1028] (2/4) Epoch 9, validation: loss=0.2501, simple_loss=0.3419, pruned_loss=0.07919, over 1796401.00 frames. 2023-06-26 18:15:07,487 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 23731MB 2023-06-26 18:15:59,910 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.455e+02 5.551e+02 8.937e+02 1.384e+03 3.879e+03, threshold=1.787e+03, percent-clipped=32.0 2023-06-26 18:16:51,568 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1625982.0, ans=0.125 2023-06-26 18:16:55,028 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1625982.0, ans=0.2 2023-06-26 18:16:57,928 INFO [train.py:996] (2/4) Epoch 9, batch 27050, loss[loss=0.1838, simple_loss=0.2807, pruned_loss=0.04346, over 21684.00 frames. ], tot_loss[loss=0.2069, simple_loss=0.2867, pruned_loss=0.06349, over 4271498.40 frames. ], batch size: 298, lr: 3.21e-03, grad_scale: 8.0 2023-06-26 18:16:59,404 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.55 vs. limit=10.0 2023-06-26 18:17:42,367 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=12.05 vs. limit=15.0 2023-06-26 18:18:01,344 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1626222.0, ans=0.125 2023-06-26 18:18:01,445 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1626222.0, ans=0.0 2023-06-26 18:18:14,608 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.95 vs. limit=10.0 2023-06-26 18:18:49,374 INFO [train.py:996] (2/4) Epoch 9, batch 27100, loss[loss=0.2235, simple_loss=0.3013, pruned_loss=0.07289, over 21891.00 frames. ], tot_loss[loss=0.2088, simple_loss=0.2886, pruned_loss=0.06455, over 4274774.50 frames. ], batch size: 371, lr: 3.21e-03, grad_scale: 8.0 2023-06-26 18:19:30,370 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.99 vs. limit=10.0 2023-06-26 18:19:42,256 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.471e+02 6.179e+02 8.599e+02 1.265e+03 2.717e+03, threshold=1.720e+03, percent-clipped=9.0 2023-06-26 18:19:46,344 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1626462.0, ans=0.95 2023-06-26 18:20:27,728 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1626582.0, ans=0.125 2023-06-26 18:20:46,692 INFO [train.py:996] (2/4) Epoch 9, batch 27150, loss[loss=0.2617, simple_loss=0.3565, pruned_loss=0.08342, over 21749.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.2998, pruned_loss=0.06783, over 4280669.28 frames. ], batch size: 332, lr: 3.21e-03, grad_scale: 8.0 2023-06-26 18:21:30,105 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1626762.0, ans=0.125 2023-06-26 18:22:34,992 INFO [train.py:996] (2/4) Epoch 9, batch 27200, loss[loss=0.2485, simple_loss=0.3317, pruned_loss=0.08262, over 21271.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.3076, pruned_loss=0.07039, over 4277142.65 frames. ], batch size: 548, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 18:22:39,929 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.59 vs. limit=22.5 2023-06-26 18:23:25,803 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.290e+02 5.594e+02 8.054e+02 1.283e+03 2.318e+03, threshold=1.611e+03, percent-clipped=7.0 2023-06-26 18:24:00,060 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 18:24:00,184 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1627182.0, ans=0.125 2023-06-26 18:24:12,773 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1627182.0, ans=0.2 2023-06-26 18:24:30,207 INFO [train.py:996] (2/4) Epoch 9, batch 27250, loss[loss=0.2297, simple_loss=0.3042, pruned_loss=0.07763, over 20643.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.3113, pruned_loss=0.07444, over 4280450.20 frames. ], batch size: 607, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 18:24:31,397 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.84 vs. limit=22.5 2023-06-26 18:25:08,700 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.02 vs. limit=15.0 2023-06-26 18:25:22,213 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1627362.0, ans=0.1 2023-06-26 18:26:20,981 INFO [train.py:996] (2/4) Epoch 9, batch 27300, loss[loss=0.2195, simple_loss=0.3108, pruned_loss=0.06412, over 21652.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.3127, pruned_loss=0.07496, over 4279538.31 frames. ], batch size: 230, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 18:26:30,821 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 18:26:40,259 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 18:26:54,108 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1627602.0, ans=0.0 2023-06-26 18:27:18,627 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.468e+02 5.640e+02 6.768e+02 9.000e+02 1.859e+03, threshold=1.354e+03, percent-clipped=2.0 2023-06-26 18:28:17,659 INFO [train.py:996] (2/4) Epoch 9, batch 27350, loss[loss=0.2218, simple_loss=0.3037, pruned_loss=0.06997, over 21242.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3147, pruned_loss=0.07557, over 4281142.13 frames. ], batch size: 143, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 18:28:21,571 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1627842.0, ans=0.1 2023-06-26 18:28:40,197 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1627902.0, ans=0.1 2023-06-26 18:28:43,969 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1627902.0, ans=0.025 2023-06-26 18:28:52,320 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.36 vs. limit=15.0 2023-06-26 18:29:22,043 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1627962.0, ans=0.0 2023-06-26 18:30:04,070 INFO [train.py:996] (2/4) Epoch 9, batch 27400, loss[loss=0.2077, simple_loss=0.2659, pruned_loss=0.07478, over 21224.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.3106, pruned_loss=0.07521, over 4286286.75 frames. ], batch size: 143, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 18:30:54,133 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.725e+02 5.126e+02 6.894e+02 1.011e+03 2.169e+03, threshold=1.379e+03, percent-clipped=11.0 2023-06-26 18:31:12,758 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1628322.0, ans=0.0 2023-06-26 18:31:25,253 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1628322.0, ans=0.1 2023-06-26 18:31:43,973 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1628382.0, ans=0.0 2023-06-26 18:31:52,292 INFO [train.py:996] (2/4) Epoch 9, batch 27450, loss[loss=0.2785, simple_loss=0.3419, pruned_loss=0.1076, over 21315.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.3042, pruned_loss=0.07379, over 4278649.73 frames. ], batch size: 507, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 18:31:53,463 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.80 vs. limit=15.0 2023-06-26 18:32:32,196 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1628502.0, ans=0.2 2023-06-26 18:32:32,209 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1628502.0, ans=0.2 2023-06-26 18:33:10,169 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1628622.0, ans=0.035 2023-06-26 18:33:13,772 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1628622.0, ans=0.2 2023-06-26 18:33:22,180 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1628682.0, ans=0.07 2023-06-26 18:33:33,902 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1628682.0, ans=0.04949747468305833 2023-06-26 18:33:38,619 INFO [train.py:996] (2/4) Epoch 9, batch 27500, loss[loss=0.2287, simple_loss=0.2994, pruned_loss=0.07903, over 21370.00 frames. ], tot_loss[loss=0.226, simple_loss=0.3034, pruned_loss=0.07429, over 4286184.64 frames. ], batch size: 144, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 18:33:46,502 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1628742.0, ans=0.1 2023-06-26 18:34:29,858 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.928e+02 5.202e+02 7.866e+02 1.174e+03 2.313e+03, threshold=1.573e+03, percent-clipped=14.0 2023-06-26 18:34:41,022 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1628862.0, ans=0.125 2023-06-26 18:35:27,114 INFO [train.py:996] (2/4) Epoch 9, batch 27550, loss[loss=0.2611, simple_loss=0.3604, pruned_loss=0.08091, over 20012.00 frames. ], tot_loss[loss=0.22, simple_loss=0.2979, pruned_loss=0.07112, over 4285766.44 frames. ], batch size: 702, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 18:35:27,799 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1629042.0, ans=0.0 2023-06-26 18:35:40,263 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=1629042.0, ans=15.0 2023-06-26 18:36:11,591 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1629162.0, ans=0.125 2023-06-26 18:36:12,134 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.97 vs. limit=22.5 2023-06-26 18:36:40,996 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.18 vs. limit=15.0 2023-06-26 18:37:01,543 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1629282.0, ans=0.125 2023-06-26 18:37:21,076 INFO [train.py:996] (2/4) Epoch 9, batch 27600, loss[loss=0.1996, simple_loss=0.2672, pruned_loss=0.06598, over 21772.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2905, pruned_loss=0.07005, over 4287159.90 frames. ], batch size: 371, lr: 3.21e-03, grad_scale: 32.0 2023-06-26 18:37:36,349 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1629342.0, ans=0.2 2023-06-26 18:37:54,729 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1629402.0, ans=0.125 2023-06-26 18:38:11,883 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.490e+02 6.372e+02 8.382e+02 1.316e+03 3.069e+03, threshold=1.676e+03, percent-clipped=15.0 2023-06-26 18:38:53,671 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1629582.0, ans=0.125 2023-06-26 18:39:08,004 INFO [train.py:996] (2/4) Epoch 9, batch 27650, loss[loss=0.2014, simple_loss=0.2738, pruned_loss=0.06448, over 21745.00 frames. ], tot_loss[loss=0.2122, simple_loss=0.2852, pruned_loss=0.06957, over 4275238.48 frames. ], batch size: 298, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 18:39:36,728 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1629702.0, ans=0.125 2023-06-26 18:39:46,751 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1629762.0, ans=0.07 2023-06-26 18:40:04,283 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1629762.0, ans=0.2 2023-06-26 18:40:09,378 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1629822.0, ans=0.1 2023-06-26 18:40:55,839 INFO [train.py:996] (2/4) Epoch 9, batch 27700, loss[loss=0.2271, simple_loss=0.3082, pruned_loss=0.07301, over 21756.00 frames. ], tot_loss[loss=0.2118, simple_loss=0.2869, pruned_loss=0.06838, over 4275315.07 frames. ], batch size: 414, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 18:41:34,951 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.18 vs. limit=12.0 2023-06-26 18:41:47,711 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.419e+02 4.763e+02 6.253e+02 8.900e+02 1.966e+03, threshold=1.251e+03, percent-clipped=3.0 2023-06-26 18:42:00,801 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1630122.0, ans=0.125 2023-06-26 18:42:01,387 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.91 vs. limit=10.0 2023-06-26 18:42:43,156 INFO [train.py:996] (2/4) Epoch 9, batch 27750, loss[loss=0.2023, simple_loss=0.313, pruned_loss=0.04582, over 20821.00 frames. ], tot_loss[loss=0.212, simple_loss=0.2891, pruned_loss=0.0675, over 4276871.78 frames. ], batch size: 608, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 18:42:44,445 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.21 vs. limit=15.0 2023-06-26 18:42:48,647 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1630242.0, ans=0.125 2023-06-26 18:42:52,645 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1630242.0, ans=0.125 2023-06-26 18:43:30,384 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1630362.0, ans=0.125 2023-06-26 18:43:40,331 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1630362.0, ans=0.125 2023-06-26 18:44:22,745 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1630542.0, ans=0.0 2023-06-26 18:44:23,841 INFO [train.py:996] (2/4) Epoch 9, batch 27800, loss[loss=0.2159, simple_loss=0.2818, pruned_loss=0.07505, over 21624.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.2878, pruned_loss=0.06776, over 4283954.74 frames. ], batch size: 548, lr: 3.21e-03, grad_scale: 8.0 2023-06-26 18:44:25,281 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.95 vs. limit=12.0 2023-06-26 18:45:23,041 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.595e+02 5.099e+02 6.470e+02 1.005e+03 1.791e+03, threshold=1.294e+03, percent-clipped=14.0 2023-06-26 18:45:25,472 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1630662.0, ans=0.125 2023-06-26 18:46:07,297 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1630782.0, ans=0.1 2023-06-26 18:46:07,430 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1630782.0, ans=0.125 2023-06-26 18:46:18,805 INFO [train.py:996] (2/4) Epoch 9, batch 27850, loss[loss=0.2188, simple_loss=0.3012, pruned_loss=0.06823, over 21799.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.2875, pruned_loss=0.06887, over 4292525.53 frames. ], batch size: 247, lr: 3.21e-03, grad_scale: 8.0 2023-06-26 18:47:01,289 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1630902.0, ans=0.125 2023-06-26 18:47:04,852 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1630962.0, ans=0.1 2023-06-26 18:47:20,041 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.76 vs. limit=15.0 2023-06-26 18:47:57,728 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1631082.0, ans=0.0 2023-06-26 18:48:11,032 INFO [train.py:996] (2/4) Epoch 9, batch 27900, loss[loss=0.2698, simple_loss=0.3821, pruned_loss=0.07869, over 21180.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.2976, pruned_loss=0.07012, over 4286077.63 frames. ], batch size: 548, lr: 3.21e-03, grad_scale: 8.0 2023-06-26 18:48:39,591 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.50 vs. limit=12.0 2023-06-26 18:48:51,210 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1631202.0, ans=0.125 2023-06-26 18:48:53,781 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.47 vs. limit=15.0 2023-06-26 18:49:12,750 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.641e+02 5.533e+02 7.337e+02 1.067e+03 2.093e+03, threshold=1.467e+03, percent-clipped=13.0 2023-06-26 18:49:20,557 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1631322.0, ans=0.125 2023-06-26 18:50:09,137 INFO [train.py:996] (2/4) Epoch 9, batch 27950, loss[loss=0.1943, simple_loss=0.2931, pruned_loss=0.0478, over 21721.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.298, pruned_loss=0.06745, over 4282136.07 frames. ], batch size: 247, lr: 3.21e-03, grad_scale: 8.0 2023-06-26 18:50:19,912 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1631442.0, ans=0.125 2023-06-26 18:50:35,336 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1631502.0, ans=0.125 2023-06-26 18:50:42,791 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1631502.0, ans=0.125 2023-06-26 18:51:34,609 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.76 vs. limit=10.0 2023-06-26 18:51:49,209 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1631682.0, ans=0.0 2023-06-26 18:51:58,507 INFO [train.py:996] (2/4) Epoch 9, batch 28000, loss[loss=0.1945, simple_loss=0.2649, pruned_loss=0.06207, over 20137.00 frames. ], tot_loss[loss=0.2122, simple_loss=0.2946, pruned_loss=0.06492, over 4285968.49 frames. ], batch size: 703, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 18:52:13,421 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 18:52:53,931 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.093e+02 5.535e+02 9.213e+02 1.280e+03 3.629e+03, threshold=1.843e+03, percent-clipped=20.0 2023-06-26 18:53:00,224 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1631862.0, ans=0.1 2023-06-26 18:53:21,001 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.57 vs. limit=15.0 2023-06-26 18:53:56,236 INFO [train.py:996] (2/4) Epoch 9, batch 28050, loss[loss=0.2768, simple_loss=0.3508, pruned_loss=0.1014, over 21522.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.293, pruned_loss=0.06627, over 4286620.83 frames. ], batch size: 471, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 18:53:57,427 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=6.98 vs. limit=22.5 2023-06-26 18:54:04,201 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.86 vs. limit=6.0 2023-06-26 18:54:15,829 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1632102.0, ans=0.125 2023-06-26 18:54:41,023 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1632162.0, ans=0.1 2023-06-26 18:54:59,885 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1632222.0, ans=0.1 2023-06-26 18:55:35,184 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1632282.0, ans=0.125 2023-06-26 18:55:44,942 INFO [train.py:996] (2/4) Epoch 9, batch 28100, loss[loss=0.1993, simple_loss=0.2687, pruned_loss=0.0649, over 21444.00 frames. ], tot_loss[loss=0.2113, simple_loss=0.2903, pruned_loss=0.06615, over 4278818.87 frames. ], batch size: 389, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 18:55:52,339 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1632342.0, ans=0.125 2023-06-26 18:56:20,348 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1632402.0, ans=0.125 2023-06-26 18:56:37,149 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.724e+02 5.263e+02 6.694e+02 1.046e+03 2.729e+03, threshold=1.339e+03, percent-clipped=5.0 2023-06-26 18:56:48,513 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1632522.0, ans=0.0 2023-06-26 18:57:29,627 INFO [train.py:996] (2/4) Epoch 9, batch 28150, loss[loss=0.2121, simple_loss=0.273, pruned_loss=0.07562, over 21764.00 frames. ], tot_loss[loss=0.2082, simple_loss=0.2848, pruned_loss=0.06581, over 4268385.97 frames. ], batch size: 371, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 18:57:38,773 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1632642.0, ans=0.125 2023-06-26 18:58:02,425 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1632702.0, ans=0.125 2023-06-26 18:58:09,443 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1632762.0, ans=0.125 2023-06-26 18:58:34,142 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1632822.0, ans=0.1 2023-06-26 18:59:09,307 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=10.91 vs. limit=15.0 2023-06-26 18:59:18,536 INFO [train.py:996] (2/4) Epoch 9, batch 28200, loss[loss=0.2113, simple_loss=0.289, pruned_loss=0.06676, over 21962.00 frames. ], tot_loss[loss=0.2096, simple_loss=0.2836, pruned_loss=0.06774, over 4267885.15 frames. ], batch size: 317, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 19:00:13,420 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.542e+02 6.148e+02 9.394e+02 1.401e+03 3.381e+03, threshold=1.879e+03, percent-clipped=30.0 2023-06-26 19:00:23,058 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.48 vs. limit=22.5 2023-06-26 19:00:28,967 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1633122.0, ans=0.04949747468305833 2023-06-26 19:00:40,059 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1633122.0, ans=0.0 2023-06-26 19:01:07,261 INFO [train.py:996] (2/4) Epoch 9, batch 28250, loss[loss=0.203, simple_loss=0.2744, pruned_loss=0.06577, over 21664.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.2864, pruned_loss=0.06989, over 4271583.29 frames. ], batch size: 332, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 19:01:11,233 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 19:01:19,976 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1633242.0, ans=0.0 2023-06-26 19:01:37,929 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1633302.0, ans=0.0 2023-06-26 19:02:10,004 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1633362.0, ans=0.0 2023-06-26 19:03:03,818 INFO [train.py:996] (2/4) Epoch 9, batch 28300, loss[loss=0.1745, simple_loss=0.2672, pruned_loss=0.04093, over 21764.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.2844, pruned_loss=0.06799, over 4267398.65 frames. ], batch size: 316, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 19:03:06,234 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1633542.0, ans=0.125 2023-06-26 19:03:11,665 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1633542.0, ans=0.125 2023-06-26 19:03:31,469 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1633602.0, ans=0.125 2023-06-26 19:03:58,619 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.505e+02 4.596e+02 7.876e+02 1.186e+03 2.671e+03, threshold=1.575e+03, percent-clipped=4.0 2023-06-26 19:04:27,684 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1633722.0, ans=0.125 2023-06-26 19:04:53,336 INFO [train.py:996] (2/4) Epoch 9, batch 28350, loss[loss=0.2109, simple_loss=0.3165, pruned_loss=0.05269, over 21567.00 frames. ], tot_loss[loss=0.203, simple_loss=0.2801, pruned_loss=0.06299, over 4258856.50 frames. ], batch size: 441, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 19:05:09,233 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1633842.0, ans=0.125 2023-06-26 19:05:14,933 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.40 vs. limit=22.5 2023-06-26 19:06:27,704 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_ff3.min_abs, batch_count=1634082.0, ans=0.2 2023-06-26 19:06:46,260 INFO [train.py:996] (2/4) Epoch 9, batch 28400, loss[loss=0.2787, simple_loss=0.3368, pruned_loss=0.1103, over 21331.00 frames. ], tot_loss[loss=0.2028, simple_loss=0.2777, pruned_loss=0.06395, over 4254773.39 frames. ], batch size: 471, lr: 3.21e-03, grad_scale: 32.0 2023-06-26 19:07:16,845 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.43 vs. limit=15.0 2023-06-26 19:07:37,370 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.70 vs. limit=15.0 2023-06-26 19:07:41,790 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.757e+02 5.507e+02 7.639e+02 1.116e+03 2.582e+03, threshold=1.528e+03, percent-clipped=10.0 2023-06-26 19:07:58,854 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.40 vs. limit=12.0 2023-06-26 19:08:33,705 INFO [train.py:996] (2/4) Epoch 9, batch 28450, loss[loss=0.187, simple_loss=0.2477, pruned_loss=0.06318, over 20779.00 frames. ], tot_loss[loss=0.2074, simple_loss=0.2814, pruned_loss=0.06665, over 4262461.92 frames. ], batch size: 608, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 19:08:55,952 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1634502.0, ans=0.0 2023-06-26 19:08:59,649 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1634502.0, ans=0.2 2023-06-26 19:09:46,802 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1634622.0, ans=0.0 2023-06-26 19:09:56,015 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.15 vs. limit=10.0 2023-06-26 19:10:22,642 INFO [train.py:996] (2/4) Epoch 9, batch 28500, loss[loss=0.2204, simple_loss=0.2942, pruned_loss=0.07328, over 21881.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.2844, pruned_loss=0.06884, over 4271844.70 frames. ], batch size: 371, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 19:11:05,142 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1634802.0, ans=0.0 2023-06-26 19:11:12,575 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.83 vs. limit=10.0 2023-06-26 19:11:20,226 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.065e+02 5.079e+02 6.899e+02 9.776e+02 2.125e+03, threshold=1.380e+03, percent-clipped=6.0 2023-06-26 19:12:13,348 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1634982.0, ans=0.125 2023-06-26 19:12:16,937 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1635042.0, ans=0.015 2023-06-26 19:12:18,039 INFO [train.py:996] (2/4) Epoch 9, batch 28550, loss[loss=0.2145, simple_loss=0.2907, pruned_loss=0.06918, over 20807.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.2935, pruned_loss=0.07141, over 4276284.23 frames. ], batch size: 608, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 19:12:18,800 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1635042.0, ans=0.1 2023-06-26 19:12:50,735 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1635102.0, ans=0.125 2023-06-26 19:12:56,115 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1635102.0, ans=0.0 2023-06-26 19:14:02,190 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1635282.0, ans=0.125 2023-06-26 19:14:06,453 INFO [train.py:996] (2/4) Epoch 9, batch 28600, loss[loss=0.2092, simple_loss=0.2874, pruned_loss=0.06545, over 21721.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.3007, pruned_loss=0.07353, over 4274570.06 frames. ], batch size: 298, lr: 3.20e-03, grad_scale: 8.0 2023-06-26 19:15:10,707 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.780e+02 5.299e+02 6.853e+02 1.013e+03 2.004e+03, threshold=1.371e+03, percent-clipped=8.0 2023-06-26 19:15:11,318 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1635462.0, ans=0.2 2023-06-26 19:15:18,067 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1635522.0, ans=0.125 2023-06-26 19:15:29,687 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.79 vs. limit=10.0 2023-06-26 19:15:39,629 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1635582.0, ans=0.0 2023-06-26 19:16:02,370 INFO [train.py:996] (2/4) Epoch 9, batch 28650, loss[loss=0.2015, simple_loss=0.2645, pruned_loss=0.06929, over 21279.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.296, pruned_loss=0.07294, over 4268717.11 frames. ], batch size: 177, lr: 3.20e-03, grad_scale: 8.0 2023-06-26 19:16:47,212 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1635762.0, ans=0.125 2023-06-26 19:17:50,894 INFO [train.py:996] (2/4) Epoch 9, batch 28700, loss[loss=0.2297, simple_loss=0.3066, pruned_loss=0.07645, over 21751.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.2956, pruned_loss=0.07396, over 4265135.62 frames. ], batch size: 351, lr: 3.20e-03, grad_scale: 8.0 2023-06-26 19:18:22,640 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1636002.0, ans=0.125 2023-06-26 19:18:38,303 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1636062.0, ans=0.125 2023-06-26 19:18:48,052 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.582e+02 5.345e+02 7.889e+02 1.390e+03 2.918e+03, threshold=1.578e+03, percent-clipped=26.0 2023-06-26 19:19:40,174 INFO [train.py:996] (2/4) Epoch 9, batch 28750, loss[loss=0.243, simple_loss=0.3436, pruned_loss=0.07118, over 19868.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.2958, pruned_loss=0.07453, over 4269742.56 frames. ], batch size: 703, lr: 3.20e-03, grad_scale: 8.0 2023-06-26 19:20:30,496 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.29 vs. limit=15.0 2023-06-26 19:21:09,468 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff3.min_abs, batch_count=1636422.0, ans=0.2 2023-06-26 19:21:11,152 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1636422.0, ans=0.125 2023-06-26 19:21:16,097 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1636482.0, ans=0.0 2023-06-26 19:21:18,063 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1636482.0, ans=0.0 2023-06-26 19:21:31,183 INFO [train.py:996] (2/4) Epoch 9, batch 28800, loss[loss=0.2359, simple_loss=0.3072, pruned_loss=0.08225, over 21376.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.2993, pruned_loss=0.07492, over 4276495.16 frames. ], batch size: 548, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 19:21:46,828 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.36 vs. limit=15.0 2023-06-26 19:22:33,370 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.768e+02 5.031e+02 6.250e+02 8.713e+02 2.260e+03, threshold=1.250e+03, percent-clipped=3.0 2023-06-26 19:23:00,294 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1636782.0, ans=0.125 2023-06-26 19:23:13,407 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1636782.0, ans=0.125 2023-06-26 19:23:15,590 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=1636782.0, ans=6.0 2023-06-26 19:23:25,642 INFO [train.py:996] (2/4) Epoch 9, batch 28850, loss[loss=0.2247, simple_loss=0.2926, pruned_loss=0.0784, over 21278.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.3006, pruned_loss=0.07588, over 4279334.69 frames. ], batch size: 159, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 19:23:27,820 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1636842.0, ans=0.1 2023-06-26 19:23:33,426 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.20 vs. limit=15.0 2023-06-26 19:23:52,039 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1636902.0, ans=0.125 2023-06-26 19:24:16,837 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1636962.0, ans=0.0 2023-06-26 19:24:34,830 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1637022.0, ans=0.0 2023-06-26 19:24:59,872 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1637082.0, ans=0.2 2023-06-26 19:25:14,979 INFO [train.py:996] (2/4) Epoch 9, batch 28900, loss[loss=0.2518, simple_loss=0.3233, pruned_loss=0.09019, over 21384.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.304, pruned_loss=0.07762, over 4275862.54 frames. ], batch size: 548, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 19:25:32,832 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1637142.0, ans=0.125 2023-06-26 19:26:18,365 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.963e+02 5.862e+02 9.485e+02 1.263e+03 2.647e+03, threshold=1.897e+03, percent-clipped=25.0 2023-06-26 19:26:22,541 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1637322.0, ans=0.125 2023-06-26 19:26:57,342 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1637382.0, ans=0.2 2023-06-26 19:27:10,783 INFO [train.py:996] (2/4) Epoch 9, batch 28950, loss[loss=0.2386, simple_loss=0.3438, pruned_loss=0.06671, over 21209.00 frames. ], tot_loss[loss=0.2271, simple_loss=0.3023, pruned_loss=0.07593, over 4275184.01 frames. ], batch size: 548, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 19:27:36,933 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1637502.0, ans=0.125 2023-06-26 19:28:02,754 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1637562.0, ans=0.0 2023-06-26 19:28:26,399 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=13.80 vs. limit=15.0 2023-06-26 19:28:53,047 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1637682.0, ans=0.0 2023-06-26 19:29:07,345 INFO [train.py:996] (2/4) Epoch 9, batch 29000, loss[loss=0.2336, simple_loss=0.32, pruned_loss=0.07358, over 21384.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3056, pruned_loss=0.07497, over 4274020.04 frames. ], batch size: 131, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 19:29:22,378 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.96 vs. limit=15.0 2023-06-26 19:29:36,525 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.36 vs. limit=15.0 2023-06-26 19:29:55,923 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.44 vs. limit=6.0 2023-06-26 19:29:57,379 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1637862.0, ans=0.1 2023-06-26 19:29:57,418 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1637862.0, ans=0.2 2023-06-26 19:30:02,151 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.689e+02 5.853e+02 8.491e+02 1.284e+03 2.472e+03, threshold=1.698e+03, percent-clipped=8.0 2023-06-26 19:30:42,751 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1637982.0, ans=0.2 2023-06-26 19:30:46,311 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1637982.0, ans=0.125 2023-06-26 19:30:57,323 INFO [train.py:996] (2/4) Epoch 9, batch 29050, loss[loss=0.2002, simple_loss=0.2704, pruned_loss=0.06501, over 21693.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.3056, pruned_loss=0.0746, over 4272362.26 frames. ], batch size: 230, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 19:31:08,344 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1638042.0, ans=0.1 2023-06-26 19:31:31,696 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1638102.0, ans=0.125 2023-06-26 19:31:31,740 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1638102.0, ans=0.2 2023-06-26 19:31:56,181 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1638162.0, ans=0.025 2023-06-26 19:32:04,003 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.29 vs. limit=22.5 2023-06-26 19:32:24,851 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1638282.0, ans=0.1 2023-06-26 19:32:46,719 INFO [train.py:996] (2/4) Epoch 9, batch 29100, loss[loss=0.1974, simple_loss=0.2665, pruned_loss=0.06414, over 21755.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.2981, pruned_loss=0.07327, over 4272460.16 frames. ], batch size: 112, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 19:33:10,657 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1638402.0, ans=0.125 2023-06-26 19:33:29,699 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1638462.0, ans=0.125 2023-06-26 19:33:31,757 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1638462.0, ans=0.5 2023-06-26 19:33:44,929 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.478e+02 5.318e+02 7.274e+02 9.701e+02 2.233e+03, threshold=1.455e+03, percent-clipped=4.0 2023-06-26 19:34:06,596 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1638522.0, ans=0.0 2023-06-26 19:34:13,778 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1638582.0, ans=0.04949747468305833 2023-06-26 19:34:33,997 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=1638642.0, ans=0.05 2023-06-26 19:34:35,023 INFO [train.py:996] (2/4) Epoch 9, batch 29150, loss[loss=0.1821, simple_loss=0.2325, pruned_loss=0.06588, over 20057.00 frames. ], tot_loss[loss=0.219, simple_loss=0.2954, pruned_loss=0.07129, over 4271947.49 frames. ], batch size: 703, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 19:34:39,132 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1638642.0, ans=0.0 2023-06-26 19:34:59,413 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1638702.0, ans=0.125 2023-06-26 19:36:23,267 INFO [train.py:996] (2/4) Epoch 9, batch 29200, loss[loss=0.2056, simple_loss=0.2612, pruned_loss=0.07499, over 20203.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2912, pruned_loss=0.07095, over 4264317.19 frames. ], batch size: 703, lr: 3.20e-03, grad_scale: 32.0 2023-06-26 19:36:52,774 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1639002.0, ans=0.1 2023-06-26 19:37:08,645 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1639062.0, ans=0.1 2023-06-26 19:37:28,585 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.809e+02 5.329e+02 8.203e+02 1.175e+03 2.946e+03, threshold=1.641e+03, percent-clipped=12.0 2023-06-26 19:38:06,112 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.87 vs. limit=15.0 2023-06-26 19:38:10,481 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1639242.0, ans=0.1 2023-06-26 19:38:11,684 INFO [train.py:996] (2/4) Epoch 9, batch 29250, loss[loss=0.2437, simple_loss=0.3325, pruned_loss=0.07742, over 21710.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.2901, pruned_loss=0.06905, over 4264326.84 frames. ], batch size: 415, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 19:38:40,317 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1639302.0, ans=0.125 2023-06-26 19:38:58,196 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1639362.0, ans=0.125 2023-06-26 19:39:48,056 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1639482.0, ans=0.0 2023-06-26 19:40:05,119 INFO [train.py:996] (2/4) Epoch 9, batch 29300, loss[loss=0.1814, simple_loss=0.2333, pruned_loss=0.06482, over 19962.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2908, pruned_loss=0.06822, over 4270518.72 frames. ], batch size: 703, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 19:40:13,418 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=1639542.0, ans=15.0 2023-06-26 19:40:13,465 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.43 vs. limit=15.0 2023-06-26 19:40:23,119 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1639602.0, ans=0.5 2023-06-26 19:40:23,186 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1639602.0, ans=0.125 2023-06-26 19:41:03,800 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.751e+02 5.530e+02 7.690e+02 1.193e+03 2.293e+03, threshold=1.538e+03, percent-clipped=8.0 2023-06-26 19:41:05,141 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1639662.0, ans=0.125 2023-06-26 19:41:55,428 INFO [train.py:996] (2/4) Epoch 9, batch 29350, loss[loss=0.1956, simple_loss=0.2661, pruned_loss=0.06253, over 21110.00 frames. ], tot_loss[loss=0.2113, simple_loss=0.287, pruned_loss=0.06785, over 4266385.25 frames. ], batch size: 143, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 19:43:17,340 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1640022.0, ans=0.125 2023-06-26 19:43:17,401 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1640022.0, ans=0.125 2023-06-26 19:43:47,616 INFO [train.py:996] (2/4) Epoch 9, batch 29400, loss[loss=0.1307, simple_loss=0.1861, pruned_loss=0.03761, over 21844.00 frames. ], tot_loss[loss=0.2101, simple_loss=0.2879, pruned_loss=0.0662, over 4267800.84 frames. ], batch size: 118, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 19:44:03,025 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1640142.0, ans=0.125 2023-06-26 19:44:06,298 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1640142.0, ans=0.125 2023-06-26 19:44:53,340 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.711e+02 5.689e+02 1.066e+03 1.595e+03 4.259e+03, threshold=2.132e+03, percent-clipped=27.0 2023-06-26 19:45:01,266 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1640322.0, ans=0.0 2023-06-26 19:45:01,355 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1640322.0, ans=0.04949747468305833 2023-06-26 19:45:14,864 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1640322.0, ans=0.125 2023-06-26 19:45:44,127 INFO [train.py:996] (2/4) Epoch 9, batch 29450, loss[loss=0.1682, simple_loss=0.2366, pruned_loss=0.04987, over 21593.00 frames. ], tot_loss[loss=0.2098, simple_loss=0.2872, pruned_loss=0.06617, over 4265940.41 frames. ], batch size: 263, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 19:46:14,265 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1640502.0, ans=0.125 2023-06-26 19:46:31,382 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1640562.0, ans=0.1 2023-06-26 19:46:33,072 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1640562.0, ans=0.1 2023-06-26 19:46:50,294 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=1640622.0, ans=10.0 2023-06-26 19:47:24,086 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=1640682.0, ans=0.5 2023-06-26 19:47:26,962 INFO [train.py:996] (2/4) Epoch 9, batch 29500, loss[loss=0.2065, simple_loss=0.2885, pruned_loss=0.06226, over 21981.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.2909, pruned_loss=0.06848, over 4275586.79 frames. ], batch size: 373, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 19:47:38,054 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1640742.0, ans=0.125 2023-06-26 19:48:30,385 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.856e+02 6.070e+02 8.083e+02 1.104e+03 1.958e+03, threshold=1.617e+03, percent-clipped=0.0 2023-06-26 19:49:14,768 INFO [train.py:996] (2/4) Epoch 9, batch 29550, loss[loss=0.2156, simple_loss=0.2863, pruned_loss=0.07246, over 21456.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2896, pruned_loss=0.06951, over 4280753.97 frames. ], batch size: 144, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 19:49:57,084 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.10 vs. limit=15.0 2023-06-26 19:51:11,589 INFO [train.py:996] (2/4) Epoch 9, batch 29600, loss[loss=0.3, simple_loss=0.4224, pruned_loss=0.08879, over 19786.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.296, pruned_loss=0.0716, over 4283423.67 frames. ], batch size: 702, lr: 3.20e-03, grad_scale: 32.0 2023-06-26 19:51:18,226 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.54 vs. limit=22.5 2023-06-26 19:51:29,485 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.38 vs. limit=22.5 2023-06-26 19:52:16,244 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.429e+02 6.278e+02 9.739e+02 1.305e+03 2.412e+03, threshold=1.948e+03, percent-clipped=12.0 2023-06-26 19:52:25,554 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1641522.0, ans=0.1 2023-06-26 19:52:32,549 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1641522.0, ans=10.0 2023-06-26 19:53:00,022 INFO [train.py:996] (2/4) Epoch 9, batch 29650, loss[loss=0.2119, simple_loss=0.2826, pruned_loss=0.07057, over 21507.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2945, pruned_loss=0.06904, over 4289058.51 frames. ], batch size: 548, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 19:53:32,174 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1641702.0, ans=0.125 2023-06-26 19:54:20,582 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.12 vs. limit=15.0 2023-06-26 19:54:41,229 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1641882.0, ans=0.1 2023-06-26 19:54:49,534 INFO [train.py:996] (2/4) Epoch 9, batch 29700, loss[loss=0.199, simple_loss=0.2708, pruned_loss=0.06365, over 21698.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.297, pruned_loss=0.06926, over 4293333.68 frames. ], batch size: 263, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 19:54:50,015 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1641942.0, ans=0.0 2023-06-26 19:55:00,184 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.14 vs. limit=15.0 2023-06-26 19:55:00,392 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.90 vs. limit=22.5 2023-06-26 19:55:36,117 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1642062.0, ans=0.2 2023-06-26 19:55:54,431 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1642062.0, ans=0.1 2023-06-26 19:55:55,276 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.514e+02 4.988e+02 7.625e+02 1.121e+03 2.201e+03, threshold=1.525e+03, percent-clipped=1.0 2023-06-26 19:55:58,467 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.04 vs. limit=15.0 2023-06-26 19:56:01,457 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1642122.0, ans=0.0 2023-06-26 19:56:22,218 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1642182.0, ans=0.125 2023-06-26 19:56:36,145 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.08 vs. limit=15.0 2023-06-26 19:56:38,118 INFO [train.py:996] (2/4) Epoch 9, batch 29750, loss[loss=0.2087, simple_loss=0.2688, pruned_loss=0.07428, over 20161.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.3003, pruned_loss=0.06878, over 4287161.14 frames. ], batch size: 702, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 19:57:41,295 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1642362.0, ans=0.1 2023-06-26 19:58:20,836 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1642482.0, ans=0.125 2023-06-26 19:58:26,774 INFO [train.py:996] (2/4) Epoch 9, batch 29800, loss[loss=0.2174, simple_loss=0.29, pruned_loss=0.07244, over 21431.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.3019, pruned_loss=0.07, over 4291417.49 frames. ], batch size: 211, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 19:59:00,251 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1642602.0, ans=0.025 2023-06-26 19:59:17,905 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 19:59:33,431 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.208e+02 7.577e+02 1.107e+03 1.626e+03 2.906e+03, threshold=2.213e+03, percent-clipped=30.0 2023-06-26 20:00:15,161 INFO [train.py:996] (2/4) Epoch 9, batch 29850, loss[loss=0.2199, simple_loss=0.3232, pruned_loss=0.05833, over 19790.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.2982, pruned_loss=0.06773, over 4295193.95 frames. ], batch size: 703, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 20:00:27,943 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1642842.0, ans=0.0 2023-06-26 20:00:29,499 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1642842.0, ans=0.95 2023-06-26 20:01:20,526 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1642962.0, ans=0.0 2023-06-26 20:02:08,064 INFO [train.py:996] (2/4) Epoch 9, batch 29900, loss[loss=0.2372, simple_loss=0.3228, pruned_loss=0.07582, over 21457.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.2974, pruned_loss=0.06922, over 4298106.44 frames. ], batch size: 131, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 20:03:09,780 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.950e+02 5.576e+02 8.031e+02 1.172e+03 2.675e+03, threshold=1.606e+03, percent-clipped=3.0 2023-06-26 20:03:30,064 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1643322.0, ans=0.125 2023-06-26 20:03:57,903 INFO [train.py:996] (2/4) Epoch 9, batch 29950, loss[loss=0.2348, simple_loss=0.3034, pruned_loss=0.0831, over 21627.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.2992, pruned_loss=0.07247, over 4295085.67 frames. ], batch size: 263, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 20:03:58,652 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1643442.0, ans=0.0 2023-06-26 20:05:30,591 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1643682.0, ans=0.125 2023-06-26 20:05:54,878 INFO [train.py:996] (2/4) Epoch 9, batch 30000, loss[loss=0.217, simple_loss=0.2891, pruned_loss=0.07248, over 20831.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.301, pruned_loss=0.0727, over 4287656.60 frames. ], batch size: 611, lr: 3.20e-03, grad_scale: 32.0 2023-06-26 20:05:54,878 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-26 20:06:15,954 INFO [train.py:1028] (2/4) Epoch 9, validation: loss=0.2518, simple_loss=0.3443, pruned_loss=0.07961, over 1796401.00 frames. 2023-06-26 20:06:15,955 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 23731MB 2023-06-26 20:06:24,748 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1643742.0, ans=0.07 2023-06-26 20:06:26,478 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1643742.0, ans=0.2 2023-06-26 20:06:39,196 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1643802.0, ans=0.125 2023-06-26 20:07:04,450 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1643862.0, ans=0.05 2023-06-26 20:07:06,967 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.26 vs. limit=8.0 2023-06-26 20:07:22,155 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.541e+02 6.693e+02 9.863e+02 1.324e+03 2.517e+03, threshold=1.973e+03, percent-clipped=14.0 2023-06-26 20:07:34,213 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.26 vs. limit=15.0 2023-06-26 20:08:09,879 INFO [train.py:996] (2/4) Epoch 9, batch 30050, loss[loss=0.2866, simple_loss=0.3975, pruned_loss=0.08781, over 21649.00 frames. ], tot_loss[loss=0.2214, simple_loss=0.3035, pruned_loss=0.06968, over 4281881.39 frames. ], batch size: 441, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 20:08:49,555 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1644102.0, ans=0.0 2023-06-26 20:09:40,410 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1644282.0, ans=0.05 2023-06-26 20:10:03,821 INFO [train.py:996] (2/4) Epoch 9, batch 30100, loss[loss=0.203, simple_loss=0.2642, pruned_loss=0.07091, over 21261.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.3033, pruned_loss=0.06971, over 4276803.99 frames. ], batch size: 177, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 20:10:04,595 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1644342.0, ans=0.125 2023-06-26 20:10:08,340 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.70 vs. limit=15.0 2023-06-26 20:11:07,511 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.020e+02 5.633e+02 9.341e+02 1.482e+03 2.871e+03, threshold=1.868e+03, percent-clipped=12.0 2023-06-26 20:11:13,509 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1644522.0, ans=0.2 2023-06-26 20:11:21,366 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.15 vs. limit=15.0 2023-06-26 20:11:30,592 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.49 vs. limit=22.5 2023-06-26 20:11:45,673 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1644582.0, ans=0.125 2023-06-26 20:11:53,761 INFO [train.py:996] (2/4) Epoch 9, batch 30150, loss[loss=0.2457, simple_loss=0.3241, pruned_loss=0.0836, over 21228.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.3001, pruned_loss=0.0706, over 4277610.03 frames. ], batch size: 143, lr: 3.19e-03, grad_scale: 16.0 2023-06-26 20:11:58,093 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1644642.0, ans=0.0 2023-06-26 20:12:23,705 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1644702.0, ans=0.1 2023-06-26 20:12:23,742 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1644702.0, ans=0.1 2023-06-26 20:13:29,804 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1644882.0, ans=0.1 2023-06-26 20:13:50,914 INFO [train.py:996] (2/4) Epoch 9, batch 30200, loss[loss=0.2222, simple_loss=0.2953, pruned_loss=0.07455, over 20674.00 frames. ], tot_loss[loss=0.22, simple_loss=0.3005, pruned_loss=0.06977, over 4270302.70 frames. ], batch size: 607, lr: 3.19e-03, grad_scale: 16.0 2023-06-26 20:13:57,465 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 20:13:57,528 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1644942.0, ans=0.125 2023-06-26 20:14:36,861 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1645062.0, ans=0.1 2023-06-26 20:15:00,679 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1645122.0, ans=0.125 2023-06-26 20:15:01,611 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.601e+02 6.016e+02 8.945e+02 1.496e+03 2.296e+03, threshold=1.789e+03, percent-clipped=11.0 2023-06-26 20:15:42,569 INFO [train.py:996] (2/4) Epoch 9, batch 30250, loss[loss=0.3225, simple_loss=0.4184, pruned_loss=0.1133, over 21530.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.3081, pruned_loss=0.07134, over 4276321.76 frames. ], batch size: 471, lr: 3.19e-03, grad_scale: 16.0 2023-06-26 20:16:11,764 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1645302.0, ans=0.125 2023-06-26 20:16:25,817 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1645302.0, ans=0.125 2023-06-26 20:16:29,235 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1645362.0, ans=0.0 2023-06-26 20:17:37,248 INFO [train.py:996] (2/4) Epoch 9, batch 30300, loss[loss=0.1816, simple_loss=0.2525, pruned_loss=0.05536, over 21394.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.3063, pruned_loss=0.07218, over 4263914.30 frames. ], batch size: 211, lr: 3.19e-03, grad_scale: 16.0 2023-06-26 20:18:47,531 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.854e+02 6.241e+02 9.150e+02 1.357e+03 2.520e+03, threshold=1.830e+03, percent-clipped=12.0 2023-06-26 20:19:07,071 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1645782.0, ans=0.125 2023-06-26 20:19:35,136 INFO [train.py:996] (2/4) Epoch 9, batch 30350, loss[loss=0.2343, simple_loss=0.305, pruned_loss=0.08186, over 21562.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3079, pruned_loss=0.07385, over 4262735.59 frames. ], batch size: 414, lr: 3.19e-03, grad_scale: 16.0 2023-06-26 20:20:01,351 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.18 vs. limit=5.0 2023-06-26 20:20:04,880 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=1645902.0, ans=0.5 2023-06-26 20:20:37,803 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1646022.0, ans=0.0 2023-06-26 20:20:58,746 INFO [train.py:996] (2/4) Epoch 9, batch 30400, loss[loss=0.2073, simple_loss=0.2556, pruned_loss=0.07952, over 20352.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.3004, pruned_loss=0.07217, over 4252541.72 frames. ], batch size: 703, lr: 3.19e-03, grad_scale: 32.0 2023-06-26 20:21:13,386 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.57 vs. limit=15.0 2023-06-26 20:21:25,606 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1646202.0, ans=0.125 2023-06-26 20:21:54,203 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1646322.0, ans=0.0 2023-06-26 20:21:55,169 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.123e+02 6.385e+02 9.749e+02 1.472e+03 9.200e+03, threshold=1.950e+03, percent-clipped=15.0 2023-06-26 20:22:02,674 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.52 vs. limit=10.0 2023-06-26 20:22:03,497 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1646322.0, ans=0.2 2023-06-26 20:22:29,111 INFO [train.py:996] (2/4) Epoch 9, batch 30450, loss[loss=0.2537, simple_loss=0.3643, pruned_loss=0.07156, over 20020.00 frames. ], tot_loss[loss=0.222, simple_loss=0.3006, pruned_loss=0.07175, over 4195778.12 frames. ], batch size: 702, lr: 3.19e-03, grad_scale: 16.0 2023-06-26 20:23:27,903 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1646622.0, ans=0.0 2023-06-26 20:23:35,702 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 20:25:55,241 INFO [train.py:996] (2/4) Epoch 10, batch 0, loss[loss=0.1886, simple_loss=0.2608, pruned_loss=0.05821, over 21776.00 frames. ], tot_loss[loss=0.1886, simple_loss=0.2608, pruned_loss=0.05821, over 21776.00 frames. ], batch size: 317, lr: 3.02e-03, grad_scale: 32.0 2023-06-26 20:25:55,241 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-26 20:26:11,826 INFO [train.py:1028] (2/4) Epoch 10, validation: loss=0.2437, simple_loss=0.3472, pruned_loss=0.0701, over 1796401.00 frames. 2023-06-26 20:26:11,826 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 23731MB 2023-06-26 20:26:27,888 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1646712.0, ans=0.0 2023-06-26 20:26:42,712 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_ff2.min_abs, batch_count=1646772.0, ans=0.1 2023-06-26 20:27:01,720 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1646832.0, ans=0.2 2023-06-26 20:27:25,808 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1646892.0, ans=0.125 2023-06-26 20:27:35,020 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.973e+02 1.183e+03 2.082e+03 3.728e+03 9.226e+03, threshold=4.165e+03, percent-clipped=55.0 2023-06-26 20:27:38,014 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.20 vs. limit=15.0 2023-06-26 20:27:57,605 INFO [train.py:996] (2/4) Epoch 10, batch 50, loss[loss=0.2346, simple_loss=0.3207, pruned_loss=0.07428, over 21727.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.3048, pruned_loss=0.07008, over 960573.76 frames. ], batch size: 298, lr: 3.02e-03, grad_scale: 16.0 2023-06-26 20:29:13,133 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.88 vs. limit=15.0 2023-06-26 20:29:22,276 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1647192.0, ans=0.1 2023-06-26 20:29:22,300 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1647192.0, ans=0.1 2023-06-26 20:29:23,940 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1647192.0, ans=0.125 2023-06-26 20:29:37,612 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1647252.0, ans=0.015 2023-06-26 20:29:44,279 INFO [train.py:996] (2/4) Epoch 10, batch 100, loss[loss=0.3042, simple_loss=0.372, pruned_loss=0.1182, over 21376.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.318, pruned_loss=0.07279, over 1689702.56 frames. ], batch size: 507, lr: 3.02e-03, grad_scale: 16.0 2023-06-26 20:29:46,910 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.39 vs. limit=6.0 2023-06-26 20:30:05,342 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1647372.0, ans=0.04949747468305833 2023-06-26 20:30:14,559 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.83 vs. limit=10.0 2023-06-26 20:31:06,600 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.835e+02 5.191e+02 6.971e+02 9.608e+02 1.975e+03, threshold=1.394e+03, percent-clipped=0.0 2023-06-26 20:31:28,455 INFO [train.py:996] (2/4) Epoch 10, batch 150, loss[loss=0.2646, simple_loss=0.3581, pruned_loss=0.08559, over 21469.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3201, pruned_loss=0.07319, over 2257526.29 frames. ], batch size: 471, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 20:31:29,339 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1647612.0, ans=0.2 2023-06-26 20:31:53,249 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1647672.0, ans=0.0 2023-06-26 20:32:46,763 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1647792.0, ans=0.2 2023-06-26 20:32:52,033 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1647792.0, ans=0.0 2023-06-26 20:33:14,166 INFO [train.py:996] (2/4) Epoch 10, batch 200, loss[loss=0.2101, simple_loss=0.299, pruned_loss=0.06059, over 21642.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.317, pruned_loss=0.07125, over 2700795.48 frames. ], batch size: 263, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 20:33:53,586 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1647972.0, ans=0.125 2023-06-26 20:34:35,472 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1648092.0, ans=0.125 2023-06-26 20:34:39,714 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.906e+02 5.339e+02 8.333e+02 1.175e+03 2.265e+03, threshold=1.667e+03, percent-clipped=16.0 2023-06-26 20:34:57,400 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1648152.0, ans=0.035 2023-06-26 20:35:01,883 INFO [train.py:996] (2/4) Epoch 10, batch 250, loss[loss=0.22, simple_loss=0.2847, pruned_loss=0.07766, over 21586.00 frames. ], tot_loss[loss=0.227, simple_loss=0.3122, pruned_loss=0.07087, over 3056692.15 frames. ], batch size: 548, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 20:35:48,448 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1648332.0, ans=0.1 2023-06-26 20:35:48,590 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1648332.0, ans=0.5 2023-06-26 20:36:14,523 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1648392.0, ans=0.0 2023-06-26 20:36:33,362 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1648392.0, ans=0.125 2023-06-26 20:36:54,056 INFO [train.py:996] (2/4) Epoch 10, batch 300, loss[loss=0.2256, simple_loss=0.2963, pruned_loss=0.07743, over 21375.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.3083, pruned_loss=0.0717, over 3327759.69 frames. ], batch size: 143, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 20:37:26,855 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1648572.0, ans=0.1 2023-06-26 20:38:06,099 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1648692.0, ans=0.125 2023-06-26 20:38:17,610 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.698e+02 5.791e+02 8.130e+02 1.304e+03 2.175e+03, threshold=1.626e+03, percent-clipped=9.0 2023-06-26 20:38:20,233 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1648692.0, ans=0.2 2023-06-26 20:38:23,534 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1648752.0, ans=0.125 2023-06-26 20:38:30,381 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1648752.0, ans=0.5 2023-06-26 20:38:34,280 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1648752.0, ans=0.0 2023-06-26 20:38:36,064 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1648752.0, ans=0.125 2023-06-26 20:38:40,495 INFO [train.py:996] (2/4) Epoch 10, batch 350, loss[loss=0.1936, simple_loss=0.2742, pruned_loss=0.05652, over 21569.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.3, pruned_loss=0.06959, over 3532277.16 frames. ], batch size: 212, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 20:39:02,851 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1648872.0, ans=0.2 2023-06-26 20:39:08,774 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1648872.0, ans=0.0 2023-06-26 20:39:32,802 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 20:39:37,547 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1648932.0, ans=0.125 2023-06-26 20:39:58,390 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.58 vs. limit=15.0 2023-06-26 20:40:24,647 INFO [train.py:996] (2/4) Epoch 10, batch 400, loss[loss=0.25, simple_loss=0.3268, pruned_loss=0.08661, over 21875.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.2937, pruned_loss=0.06796, over 3697800.67 frames. ], batch size: 107, lr: 3.01e-03, grad_scale: 32.0 2023-06-26 20:40:34,941 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1649112.0, ans=10.0 2023-06-26 20:41:47,585 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1649292.0, ans=0.2 2023-06-26 20:41:53,092 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.780e+02 7.996e+02 1.335e+03 1.838e+03 3.332e+03, threshold=2.670e+03, percent-clipped=35.0 2023-06-26 20:41:54,463 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.86 vs. limit=15.0 2023-06-26 20:41:59,897 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.10 vs. limit=15.0 2023-06-26 20:42:09,650 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1649352.0, ans=0.1 2023-06-26 20:42:09,794 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1649352.0, ans=0.0 2023-06-26 20:42:13,129 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1649412.0, ans=0.125 2023-06-26 20:42:14,157 INFO [train.py:996] (2/4) Epoch 10, batch 450, loss[loss=0.1848, simple_loss=0.2521, pruned_loss=0.05875, over 21632.00 frames. ], tot_loss[loss=0.2121, simple_loss=0.2909, pruned_loss=0.06661, over 3828996.64 frames. ], batch size: 298, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 20:42:58,832 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1649532.0, ans=0.125 2023-06-26 20:43:30,055 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1649592.0, ans=0.125 2023-06-26 20:43:43,154 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1649652.0, ans=0.125 2023-06-26 20:43:58,549 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1649712.0, ans=0.125 2023-06-26 20:43:59,418 INFO [train.py:996] (2/4) Epoch 10, batch 500, loss[loss=0.201, simple_loss=0.278, pruned_loss=0.06201, over 21609.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.2913, pruned_loss=0.06599, over 3932474.76 frames. ], batch size: 391, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 20:44:16,399 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.87 vs. limit=10.0 2023-06-26 20:45:24,856 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.219e+02 9.001e+02 1.327e+03 2.089e+03 4.282e+03, threshold=2.653e+03, percent-clipped=10.0 2023-06-26 20:45:39,501 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1649952.0, ans=0.125 2023-06-26 20:45:51,435 INFO [train.py:996] (2/4) Epoch 10, batch 550, loss[loss=0.2054, simple_loss=0.2778, pruned_loss=0.06653, over 21476.00 frames. ], tot_loss[loss=0.2121, simple_loss=0.2924, pruned_loss=0.06593, over 4008150.45 frames. ], batch size: 131, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 20:46:02,951 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.37 vs. limit=10.0 2023-06-26 20:46:12,763 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1650072.0, ans=0.125 2023-06-26 20:46:16,346 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1650072.0, ans=0.125 2023-06-26 20:46:27,027 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.47 vs. limit=15.0 2023-06-26 20:47:16,927 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1650252.0, ans=0.125 2023-06-26 20:47:20,317 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1650252.0, ans=0.2 2023-06-26 20:47:33,185 INFO [train.py:996] (2/4) Epoch 10, batch 600, loss[loss=0.2126, simple_loss=0.3217, pruned_loss=0.05172, over 21731.00 frames. ], tot_loss[loss=0.2137, simple_loss=0.2955, pruned_loss=0.06591, over 4067580.20 frames. ], batch size: 298, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 20:48:02,790 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1650372.0, ans=0.125 2023-06-26 20:48:58,839 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.973e+02 6.857e+02 1.039e+03 1.439e+03 2.641e+03, threshold=2.079e+03, percent-clipped=0.0 2023-06-26 20:49:13,341 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1650552.0, ans=0.125 2023-06-26 20:49:14,958 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1650552.0, ans=0.0 2023-06-26 20:49:19,463 INFO [train.py:996] (2/4) Epoch 10, batch 650, loss[loss=0.2413, simple_loss=0.3085, pruned_loss=0.0871, over 15127.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.298, pruned_loss=0.06709, over 4104731.29 frames. ], batch size: 61, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 20:49:39,481 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1650612.0, ans=0.2 2023-06-26 20:50:40,615 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1650792.0, ans=0.125 2023-06-26 20:50:46,065 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1650852.0, ans=0.2 2023-06-26 20:51:00,890 INFO [train.py:996] (2/4) Epoch 10, batch 700, loss[loss=0.1935, simple_loss=0.2618, pruned_loss=0.06263, over 21328.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.3044, pruned_loss=0.06957, over 4147299.58 frames. ], batch size: 144, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 20:52:02,845 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1651032.0, ans=0.0 2023-06-26 20:52:26,601 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.100e+02 6.227e+02 9.890e+02 1.482e+03 2.866e+03, threshold=1.978e+03, percent-clipped=9.0 2023-06-26 20:52:30,601 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1651152.0, ans=0.125 2023-06-26 20:52:30,722 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1651152.0, ans=0.2 2023-06-26 20:52:39,062 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1651152.0, ans=0.0 2023-06-26 20:52:47,489 INFO [train.py:996] (2/4) Epoch 10, batch 750, loss[loss=0.2107, simple_loss=0.269, pruned_loss=0.07616, over 21480.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.3021, pruned_loss=0.06946, over 4178868.68 frames. ], batch size: 548, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 20:53:14,112 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1651272.0, ans=0.125 2023-06-26 20:53:36,766 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.35 vs. limit=15.0 2023-06-26 20:53:40,339 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.85 vs. limit=15.0 2023-06-26 20:53:43,257 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1651332.0, ans=0.1 2023-06-26 20:54:23,545 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1651452.0, ans=0.125 2023-06-26 20:54:35,031 INFO [train.py:996] (2/4) Epoch 10, batch 800, loss[loss=0.2269, simple_loss=0.2944, pruned_loss=0.07969, over 21699.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.2976, pruned_loss=0.069, over 4207699.20 frames. ], batch size: 441, lr: 3.01e-03, grad_scale: 32.0 2023-06-26 20:54:57,199 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1651572.0, ans=0.0 2023-06-26 20:55:04,213 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1651572.0, ans=0.2 2023-06-26 20:55:09,445 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1651572.0, ans=0.0 2023-06-26 20:56:04,639 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.663e+02 5.824e+02 9.070e+02 1.319e+03 2.505e+03, threshold=1.814e+03, percent-clipped=4.0 2023-06-26 20:56:18,925 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1651752.0, ans=0.1 2023-06-26 20:56:23,622 INFO [train.py:996] (2/4) Epoch 10, batch 850, loss[loss=0.1743, simple_loss=0.2461, pruned_loss=0.05122, over 21723.00 frames. ], tot_loss[loss=0.2158, simple_loss=0.294, pruned_loss=0.0688, over 4230160.88 frames. ], batch size: 247, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 20:57:48,722 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.70 vs. limit=12.0 2023-06-26 20:57:58,195 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1652052.0, ans=0.0 2023-06-26 20:58:18,447 INFO [train.py:996] (2/4) Epoch 10, batch 900, loss[loss=0.1948, simple_loss=0.27, pruned_loss=0.05977, over 21158.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2915, pruned_loss=0.06875, over 4246809.31 frames. ], batch size: 143, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 20:58:22,506 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1652112.0, ans=0.125 2023-06-26 20:59:22,231 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.05 vs. limit=22.5 2023-06-26 20:59:29,337 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.75 vs. limit=22.5 2023-06-26 20:59:42,414 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.674e+02 4.955e+02 6.528e+02 1.022e+03 3.124e+03, threshold=1.306e+03, percent-clipped=4.0 2023-06-26 20:59:52,155 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1652352.0, ans=0.2 2023-06-26 21:00:07,589 INFO [train.py:996] (2/4) Epoch 10, batch 950, loss[loss=0.2218, simple_loss=0.2914, pruned_loss=0.07607, over 21909.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.2903, pruned_loss=0.0688, over 4252140.02 frames. ], batch size: 414, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 21:00:16,223 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.40 vs. limit=12.0 2023-06-26 21:01:01,782 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 21:01:18,648 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1652592.0, ans=0.125 2023-06-26 21:01:50,706 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1652652.0, ans=0.125 2023-06-26 21:01:54,403 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff2.min_abs, batch_count=1652652.0, ans=0.1 2023-06-26 21:01:56,955 INFO [train.py:996] (2/4) Epoch 10, batch 1000, loss[loss=0.2114, simple_loss=0.3008, pruned_loss=0.06106, over 21402.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.2916, pruned_loss=0.06879, over 4264728.61 frames. ], batch size: 548, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 21:01:59,602 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1652712.0, ans=0.125 2023-06-26 21:02:08,693 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.74 vs. limit=22.5 2023-06-26 21:03:31,701 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.126e+02 7.237e+02 1.217e+03 1.852e+03 3.276e+03, threshold=2.433e+03, percent-clipped=47.0 2023-06-26 21:03:56,356 INFO [train.py:996] (2/4) Epoch 10, batch 1050, loss[loss=0.2078, simple_loss=0.3001, pruned_loss=0.05779, over 21623.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2908, pruned_loss=0.06821, over 4273098.34 frames. ], batch size: 441, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 21:05:20,572 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1653192.0, ans=0.125 2023-06-26 21:05:33,910 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.74 vs. limit=22.5 2023-06-26 21:05:42,781 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.26 vs. limit=15.0 2023-06-26 21:05:46,782 INFO [train.py:996] (2/4) Epoch 10, batch 1100, loss[loss=0.2254, simple_loss=0.3249, pruned_loss=0.063, over 21844.00 frames. ], tot_loss[loss=0.2117, simple_loss=0.2894, pruned_loss=0.06702, over 4278586.38 frames. ], batch size: 371, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 21:05:47,434 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1653312.0, ans=0.125 2023-06-26 21:06:23,775 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1653372.0, ans=0.1 2023-06-26 21:06:30,335 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1653372.0, ans=0.125 2023-06-26 21:07:01,038 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1653492.0, ans=0.125 2023-06-26 21:07:14,103 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.610e+02 5.858e+02 8.624e+02 1.218e+03 2.996e+03, threshold=1.725e+03, percent-clipped=2.0 2023-06-26 21:07:32,214 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1653552.0, ans=0.125 2023-06-26 21:07:38,291 INFO [train.py:996] (2/4) Epoch 10, batch 1150, loss[loss=0.264, simple_loss=0.3159, pruned_loss=0.1061, over 21683.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.289, pruned_loss=0.0671, over 4279794.04 frames. ], batch size: 507, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 21:07:58,739 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1653612.0, ans=0.2 2023-06-26 21:08:32,029 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1653732.0, ans=0.125 2023-06-26 21:08:40,733 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.24 vs. limit=15.0 2023-06-26 21:08:43,588 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1653732.0, ans=0.0 2023-06-26 21:08:47,909 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1653732.0, ans=0.0 2023-06-26 21:09:23,370 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1653852.0, ans=0.09899494936611666 2023-06-26 21:09:36,643 INFO [train.py:996] (2/4) Epoch 10, batch 1200, loss[loss=0.1935, simple_loss=0.2799, pruned_loss=0.05353, over 21361.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.2906, pruned_loss=0.06775, over 4279601.89 frames. ], batch size: 194, lr: 3.01e-03, grad_scale: 32.0 2023-06-26 21:09:42,663 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1653912.0, ans=0.125 2023-06-26 21:10:43,359 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 21:10:53,851 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1654092.0, ans=0.125 2023-06-26 21:11:00,211 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.823e+02 5.719e+02 8.661e+02 1.239e+03 3.080e+03, threshold=1.732e+03, percent-clipped=10.0 2023-06-26 21:11:25,996 INFO [train.py:996] (2/4) Epoch 10, batch 1250, loss[loss=0.2021, simple_loss=0.2842, pruned_loss=0.06002, over 21842.00 frames. ], tot_loss[loss=0.2147, simple_loss=0.2931, pruned_loss=0.06814, over 4282658.33 frames. ], batch size: 298, lr: 3.01e-03, grad_scale: 32.0 2023-06-26 21:11:48,241 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1654272.0, ans=0.125 2023-06-26 21:12:19,992 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1654332.0, ans=0.125 2023-06-26 21:12:56,501 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1654452.0, ans=0.125 2023-06-26 21:13:16,672 INFO [train.py:996] (2/4) Epoch 10, batch 1300, loss[loss=0.202, simple_loss=0.3037, pruned_loss=0.05018, over 21655.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.295, pruned_loss=0.06797, over 4290569.43 frames. ], batch size: 263, lr: 3.01e-03, grad_scale: 32.0 2023-06-26 21:13:23,542 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.09 vs. limit=22.5 2023-06-26 21:14:43,546 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.194e+02 7.398e+02 1.015e+03 1.513e+03 3.841e+03, threshold=2.029e+03, percent-clipped=13.0 2023-06-26 21:14:45,849 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_ff2.min_abs, batch_count=1654752.0, ans=0.1 2023-06-26 21:15:06,108 INFO [train.py:996] (2/4) Epoch 10, batch 1350, loss[loss=0.2116, simple_loss=0.2831, pruned_loss=0.07004, over 21836.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.297, pruned_loss=0.06901, over 4288761.49 frames. ], batch size: 371, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 21:15:27,877 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1654872.0, ans=0.0 2023-06-26 21:15:29,443 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1654872.0, ans=0.0 2023-06-26 21:16:50,461 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1655052.0, ans=0.125 2023-06-26 21:16:52,683 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.36 vs. limit=15.0 2023-06-26 21:17:00,108 INFO [train.py:996] (2/4) Epoch 10, batch 1400, loss[loss=0.2402, simple_loss=0.3121, pruned_loss=0.08414, over 21380.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2964, pruned_loss=0.06896, over 4288047.67 frames. ], batch size: 548, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 21:17:16,326 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1655172.0, ans=0.125 2023-06-26 21:17:33,908 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1655172.0, ans=0.125 2023-06-26 21:18:12,253 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1655292.0, ans=0.125 2023-06-26 21:18:25,054 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.941e+02 5.863e+02 9.944e+02 1.473e+03 3.016e+03, threshold=1.989e+03, percent-clipped=13.0 2023-06-26 21:18:27,739 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1655352.0, ans=0.125 2023-06-26 21:18:48,226 INFO [train.py:996] (2/4) Epoch 10, batch 1450, loss[loss=0.2654, simple_loss=0.3392, pruned_loss=0.09581, over 21332.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.2975, pruned_loss=0.06988, over 4282046.30 frames. ], batch size: 549, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 21:18:49,560 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.11 vs. limit=15.0 2023-06-26 21:19:36,409 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.19 vs. limit=6.0 2023-06-26 21:19:44,363 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1655592.0, ans=0.2 2023-06-26 21:20:03,839 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1655592.0, ans=0.125 2023-06-26 21:20:05,781 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1655592.0, ans=0.125 2023-06-26 21:20:36,889 INFO [train.py:996] (2/4) Epoch 10, batch 1500, loss[loss=0.2439, simple_loss=0.3106, pruned_loss=0.08857, over 21365.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.2983, pruned_loss=0.06993, over 4279958.52 frames. ], batch size: 159, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 21:20:41,390 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.56 vs. limit=15.0 2023-06-26 21:21:28,416 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1655832.0, ans=0.125 2023-06-26 21:21:59,430 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.64 vs. limit=15.0 2023-06-26 21:22:03,790 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.679e+02 5.579e+02 7.007e+02 1.027e+03 2.656e+03, threshold=1.401e+03, percent-clipped=4.0 2023-06-26 21:22:23,449 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 21:22:29,806 INFO [train.py:996] (2/4) Epoch 10, batch 1550, loss[loss=0.1673, simple_loss=0.2518, pruned_loss=0.0414, over 21358.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2952, pruned_loss=0.06887, over 4275646.35 frames. ], batch size: 131, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 21:22:36,273 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.83 vs. limit=15.0 2023-06-26 21:23:12,373 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1656132.0, ans=0.125 2023-06-26 21:23:22,913 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1656132.0, ans=0.04949747468305833 2023-06-26 21:24:12,659 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.39 vs. limit=15.0 2023-06-26 21:24:18,748 INFO [train.py:996] (2/4) Epoch 10, batch 1600, loss[loss=0.226, simple_loss=0.2975, pruned_loss=0.07729, over 21809.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.2919, pruned_loss=0.068, over 4272585.26 frames. ], batch size: 351, lr: 3.01e-03, grad_scale: 32.0 2023-06-26 21:24:37,538 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=6.55 vs. limit=22.5 2023-06-26 21:25:50,460 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.007e+02 6.112e+02 1.058e+03 1.502e+03 3.121e+03, threshold=2.116e+03, percent-clipped=30.0 2023-06-26 21:26:07,889 INFO [train.py:996] (2/4) Epoch 10, batch 1650, loss[loss=0.235, simple_loss=0.3058, pruned_loss=0.08214, over 21769.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.2908, pruned_loss=0.0678, over 4277078.77 frames. ], batch size: 389, lr: 3.01e-03, grad_scale: 32.0 2023-06-26 21:26:24,477 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1656612.0, ans=0.1 2023-06-26 21:27:03,160 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1656732.0, ans=0.0 2023-06-26 21:27:07,212 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.44 vs. limit=15.0 2023-06-26 21:27:20,681 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1656792.0, ans=0.0 2023-06-26 21:27:30,130 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.30 vs. limit=15.0 2023-06-26 21:27:51,471 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1656852.0, ans=0.125 2023-06-26 21:28:01,581 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1656852.0, ans=0.0 2023-06-26 21:28:03,098 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1656912.0, ans=0.0 2023-06-26 21:28:04,345 INFO [train.py:996] (2/4) Epoch 10, batch 1700, loss[loss=0.1948, simple_loss=0.3035, pruned_loss=0.04303, over 21012.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2919, pruned_loss=0.06758, over 4282494.66 frames. ], batch size: 607, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 21:28:42,742 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1656972.0, ans=0.0 2023-06-26 21:29:04,014 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1657032.0, ans=0.125 2023-06-26 21:29:40,236 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.858e+02 6.519e+02 9.043e+02 1.348e+03 2.914e+03, threshold=1.809e+03, percent-clipped=3.0 2023-06-26 21:29:56,230 INFO [train.py:996] (2/4) Epoch 10, batch 1750, loss[loss=0.26, simple_loss=0.3333, pruned_loss=0.0933, over 21789.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.2935, pruned_loss=0.06761, over 4279838.13 frames. ], batch size: 441, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 21:30:02,360 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1657212.0, ans=0.05 2023-06-26 21:30:45,603 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1657332.0, ans=0.0 2023-06-26 21:30:56,597 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1657332.0, ans=0.125 2023-06-26 21:31:54,540 INFO [train.py:996] (2/4) Epoch 10, batch 1800, loss[loss=0.1878, simple_loss=0.2613, pruned_loss=0.05718, over 21469.00 frames. ], tot_loss[loss=0.2103, simple_loss=0.2908, pruned_loss=0.06495, over 4284720.78 frames. ], batch size: 195, lr: 3.01e-03, grad_scale: 8.0 2023-06-26 21:31:55,277 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1657512.0, ans=0.2 2023-06-26 21:31:59,045 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1657512.0, ans=0.0 2023-06-26 21:32:07,548 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1657512.0, ans=0.125 2023-06-26 21:33:24,612 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.06 vs. limit=15.0 2023-06-26 21:33:24,912 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.986e+02 5.658e+02 9.190e+02 1.767e+03 4.020e+03, threshold=1.838e+03, percent-clipped=23.0 2023-06-26 21:33:44,342 INFO [train.py:996] (2/4) Epoch 10, batch 1850, loss[loss=0.2319, simple_loss=0.2991, pruned_loss=0.08234, over 21887.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.2925, pruned_loss=0.0639, over 4284281.33 frames. ], batch size: 124, lr: 3.01e-03, grad_scale: 8.0 2023-06-26 21:34:10,972 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1657872.0, ans=0.1 2023-06-26 21:34:18,129 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1657872.0, ans=0.1 2023-06-26 21:34:49,619 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=1657932.0, ans=10.0 2023-06-26 21:35:12,084 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1658052.0, ans=10.0 2023-06-26 21:35:32,244 INFO [train.py:996] (2/4) Epoch 10, batch 1900, loss[loss=0.2222, simple_loss=0.296, pruned_loss=0.07417, over 21600.00 frames. ], tot_loss[loss=0.2118, simple_loss=0.2954, pruned_loss=0.06416, over 4284693.48 frames. ], batch size: 212, lr: 3.01e-03, grad_scale: 8.0 2023-06-26 21:35:32,747 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1658112.0, ans=0.125 2023-06-26 21:35:32,942 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1658112.0, ans=0.125 2023-06-26 21:35:50,615 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1658112.0, ans=0.125 2023-06-26 21:35:51,193 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.67 vs. limit=22.5 2023-06-26 21:36:45,972 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1658292.0, ans=0.125 2023-06-26 21:36:59,940 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 21:37:08,220 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.968e+02 6.601e+02 8.691e+02 1.330e+03 2.480e+03, threshold=1.738e+03, percent-clipped=9.0 2023-06-26 21:37:22,024 INFO [train.py:996] (2/4) Epoch 10, batch 1950, loss[loss=0.1809, simple_loss=0.244, pruned_loss=0.05884, over 21633.00 frames. ], tot_loss[loss=0.211, simple_loss=0.2924, pruned_loss=0.06477, over 4286117.44 frames. ], batch size: 282, lr: 3.00e-03, grad_scale: 8.0 2023-06-26 21:38:03,694 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1658472.0, ans=0.125 2023-06-26 21:38:28,692 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1658532.0, ans=0.125 2023-06-26 21:38:28,707 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 21:38:42,731 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1658592.0, ans=0.125 2023-06-26 21:39:11,017 INFO [train.py:996] (2/4) Epoch 10, batch 2000, loss[loss=0.1941, simple_loss=0.2597, pruned_loss=0.06429, over 21730.00 frames. ], tot_loss[loss=0.2088, simple_loss=0.2898, pruned_loss=0.06392, over 4285203.49 frames. ], batch size: 351, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 21:39:32,829 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1658772.0, ans=0.5 2023-06-26 21:39:50,674 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1658772.0, ans=0.125 2023-06-26 21:40:08,188 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1658832.0, ans=0.125 2023-06-26 21:40:13,823 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1658832.0, ans=0.1 2023-06-26 21:40:46,016 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.256e+02 7.434e+02 1.051e+03 1.825e+03 4.116e+03, threshold=2.102e+03, percent-clipped=26.0 2023-06-26 21:41:00,348 INFO [train.py:996] (2/4) Epoch 10, batch 2050, loss[loss=0.2082, simple_loss=0.2849, pruned_loss=0.06578, over 21885.00 frames. ], tot_loss[loss=0.2106, simple_loss=0.2914, pruned_loss=0.06485, over 4291337.05 frames. ], batch size: 332, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 21:41:13,120 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 21:41:30,901 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1659072.0, ans=0.0 2023-06-26 21:41:53,578 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1659132.0, ans=0.125 2023-06-26 21:41:53,597 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1659132.0, ans=0.125 2023-06-26 21:42:02,691 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.01 vs. limit=15.0 2023-06-26 21:42:05,233 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1659132.0, ans=0.125 2023-06-26 21:42:30,003 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1659252.0, ans=0.125 2023-06-26 21:42:36,654 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1659252.0, ans=0.1 2023-06-26 21:42:40,123 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1659252.0, ans=0.2 2023-06-26 21:42:53,052 INFO [train.py:996] (2/4) Epoch 10, batch 2100, loss[loss=0.2374, simple_loss=0.3107, pruned_loss=0.08202, over 21212.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2953, pruned_loss=0.0664, over 4297305.60 frames. ], batch size: 159, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 21:42:56,932 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1659312.0, ans=0.0 2023-06-26 21:43:09,980 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1659312.0, ans=0.125 2023-06-26 21:43:52,204 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1659432.0, ans=0.125 2023-06-26 21:43:52,928 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.37 vs. limit=15.0 2023-06-26 21:44:22,017 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.975e+02 6.453e+02 1.021e+03 1.329e+03 2.280e+03, threshold=2.042e+03, percent-clipped=5.0 2023-06-26 21:44:41,181 INFO [train.py:996] (2/4) Epoch 10, batch 2150, loss[loss=0.215, simple_loss=0.297, pruned_loss=0.06653, over 21691.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.2928, pruned_loss=0.0664, over 4298418.36 frames. ], batch size: 391, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 21:45:00,514 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.79 vs. limit=15.0 2023-06-26 21:45:03,474 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1659672.0, ans=0.0 2023-06-26 21:45:48,079 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1659792.0, ans=0.125 2023-06-26 21:46:29,854 INFO [train.py:996] (2/4) Epoch 10, batch 2200, loss[loss=0.2078, simple_loss=0.2921, pruned_loss=0.06173, over 21328.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.2941, pruned_loss=0.0674, over 4296654.03 frames. ], batch size: 131, lr: 3.00e-03, grad_scale: 8.0 2023-06-26 21:46:41,027 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1659912.0, ans=0.0 2023-06-26 21:46:41,034 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1659912.0, ans=0.0 2023-06-26 21:47:27,221 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1660032.0, ans=0.125 2023-06-26 21:47:27,794 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.89 vs. limit=6.0 2023-06-26 21:47:30,349 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1660032.0, ans=0.0 2023-06-26 21:47:40,627 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=1660092.0, ans=0.05 2023-06-26 21:48:00,549 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.021e+02 5.718e+02 8.930e+02 1.284e+03 2.710e+03, threshold=1.786e+03, percent-clipped=5.0 2023-06-26 21:48:13,220 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1660152.0, ans=0.04949747468305833 2023-06-26 21:48:17,709 INFO [train.py:996] (2/4) Epoch 10, batch 2250, loss[loss=0.1498, simple_loss=0.2291, pruned_loss=0.03524, over 21410.00 frames. ], tot_loss[loss=0.2117, simple_loss=0.2908, pruned_loss=0.06635, over 4288090.17 frames. ], batch size: 131, lr: 3.00e-03, grad_scale: 8.0 2023-06-26 21:48:50,086 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1660272.0, ans=0.1 2023-06-26 21:49:02,206 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1660332.0, ans=0.0 2023-06-26 21:49:28,708 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.50 vs. limit=22.5 2023-06-26 21:49:33,040 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1660392.0, ans=0.2 2023-06-26 21:49:34,717 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1660392.0, ans=0.07 2023-06-26 21:49:41,364 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 21:50:00,488 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1660452.0, ans=10.0 2023-06-26 21:50:04,951 INFO [train.py:996] (2/4) Epoch 10, batch 2300, loss[loss=0.1846, simple_loss=0.2467, pruned_loss=0.06128, over 21492.00 frames. ], tot_loss[loss=0.2096, simple_loss=0.2874, pruned_loss=0.06588, over 4282509.53 frames. ], batch size: 195, lr: 3.00e-03, grad_scale: 8.0 2023-06-26 21:50:21,241 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1660512.0, ans=0.1 2023-06-26 21:50:49,360 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.93 vs. limit=15.0 2023-06-26 21:51:19,236 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.13 vs. limit=15.0 2023-06-26 21:51:40,842 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.466e+02 6.340e+02 1.061e+03 1.425e+03 3.450e+03, threshold=2.122e+03, percent-clipped=15.0 2023-06-26 21:51:52,978 INFO [train.py:996] (2/4) Epoch 10, batch 2350, loss[loss=0.2147, simple_loss=0.2859, pruned_loss=0.07176, over 21190.00 frames. ], tot_loss[loss=0.2089, simple_loss=0.2852, pruned_loss=0.06628, over 4282403.26 frames. ], batch size: 159, lr: 3.00e-03, grad_scale: 8.0 2023-06-26 21:52:04,518 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1660812.0, ans=0.0 2023-06-26 21:52:09,524 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1660812.0, ans=0.125 2023-06-26 21:52:36,302 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 21:53:12,785 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.71 vs. limit=12.0 2023-06-26 21:53:23,071 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.73 vs. limit=15.0 2023-06-26 21:53:27,920 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1661052.0, ans=0.2 2023-06-26 21:53:46,737 INFO [train.py:996] (2/4) Epoch 10, batch 2400, loss[loss=0.2386, simple_loss=0.3104, pruned_loss=0.08343, over 21332.00 frames. ], tot_loss[loss=0.2113, simple_loss=0.2874, pruned_loss=0.06758, over 4278263.83 frames. ], batch size: 159, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 21:54:16,320 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.25 vs. limit=6.0 2023-06-26 21:54:23,037 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1661172.0, ans=0.0 2023-06-26 21:54:36,893 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1661232.0, ans=0.125 2023-06-26 21:54:47,487 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.71 vs. limit=22.5 2023-06-26 21:55:17,527 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.139e+02 8.857e+02 1.254e+03 1.714e+03 3.828e+03, threshold=2.507e+03, percent-clipped=13.0 2023-06-26 21:55:19,729 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1661352.0, ans=0.125 2023-06-26 21:55:21,445 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1661352.0, ans=0.2 2023-06-26 21:55:35,118 INFO [train.py:996] (2/4) Epoch 10, batch 2450, loss[loss=0.2231, simple_loss=0.296, pruned_loss=0.07516, over 21618.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2911, pruned_loss=0.06957, over 4280536.10 frames. ], batch size: 441, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 21:55:48,264 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 21:56:08,004 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.86 vs. limit=15.0 2023-06-26 21:56:22,973 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.08 vs. limit=22.5 2023-06-26 21:56:38,243 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1661592.0, ans=0.1 2023-06-26 21:57:01,617 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.33 vs. limit=15.0 2023-06-26 21:57:04,792 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.16 vs. limit=15.0 2023-06-26 21:57:22,929 INFO [train.py:996] (2/4) Epoch 10, batch 2500, loss[loss=0.2253, simple_loss=0.298, pruned_loss=0.07631, over 21530.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.2911, pruned_loss=0.0693, over 4273983.39 frames. ], batch size: 441, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 21:58:26,102 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1661892.0, ans=0.0 2023-06-26 21:58:49,944 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1661952.0, ans=0.2 2023-06-26 21:58:52,895 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.949e+02 5.442e+02 7.727e+02 1.360e+03 2.872e+03, threshold=1.545e+03, percent-clipped=3.0 2023-06-26 21:59:16,971 INFO [train.py:996] (2/4) Epoch 10, batch 2550, loss[loss=0.2573, simple_loss=0.3379, pruned_loss=0.0884, over 21783.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2913, pruned_loss=0.0688, over 4271839.04 frames. ], batch size: 118, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 21:59:17,804 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1662012.0, ans=0.125 2023-06-26 21:59:52,240 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1662072.0, ans=0.05 2023-06-26 21:59:55,582 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1662132.0, ans=0.1 2023-06-26 22:00:09,867 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.87 vs. limit=12.0 2023-06-26 22:00:50,031 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1662252.0, ans=0.125 2023-06-26 22:00:58,656 INFO [train.py:996] (2/4) Epoch 10, batch 2600, loss[loss=0.1743, simple_loss=0.238, pruned_loss=0.0553, over 21399.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.2897, pruned_loss=0.06921, over 4276733.05 frames. ], batch size: 211, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 22:00:59,219 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1662312.0, ans=0.1 2023-06-26 22:01:16,931 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1662312.0, ans=0.0 2023-06-26 22:01:42,010 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1662372.0, ans=0.2 2023-06-26 22:01:50,010 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1662432.0, ans=0.0 2023-06-26 22:02:27,489 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1662552.0, ans=0.2 2023-06-26 22:02:30,508 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.653e+02 5.932e+02 7.910e+02 1.183e+03 2.273e+03, threshold=1.582e+03, percent-clipped=10.0 2023-06-26 22:02:48,669 INFO [train.py:996] (2/4) Epoch 10, batch 2650, loss[loss=0.1817, simple_loss=0.2559, pruned_loss=0.05373, over 21790.00 frames. ], tot_loss[loss=0.215, simple_loss=0.2911, pruned_loss=0.06947, over 4275971.41 frames. ], batch size: 247, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 22:03:05,737 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1662612.0, ans=0.2 2023-06-26 22:04:09,548 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1662792.0, ans=0.2 2023-06-26 22:04:15,299 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1662852.0, ans=0.2 2023-06-26 22:04:43,344 INFO [train.py:996] (2/4) Epoch 10, batch 2700, loss[loss=0.1825, simple_loss=0.2571, pruned_loss=0.05396, over 21632.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.2914, pruned_loss=0.0704, over 4268939.05 frames. ], batch size: 230, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 22:05:08,105 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1662972.0, ans=0.1 2023-06-26 22:05:27,031 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.77 vs. limit=15.0 2023-06-26 22:06:00,541 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.82 vs. limit=15.0 2023-06-26 22:06:09,118 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.093e+02 5.804e+02 8.533e+02 1.371e+03 2.390e+03, threshold=1.707e+03, percent-clipped=16.0 2023-06-26 22:06:19,548 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1663152.0, ans=0.0 2023-06-26 22:06:31,061 INFO [train.py:996] (2/4) Epoch 10, batch 2750, loss[loss=0.2383, simple_loss=0.3342, pruned_loss=0.07123, over 20883.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.2918, pruned_loss=0.07003, over 4267942.77 frames. ], batch size: 607, lr: 3.00e-03, grad_scale: 8.0 2023-06-26 22:06:49,492 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1663272.0, ans=0.04949747468305833 2023-06-26 22:07:40,900 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.99 vs. limit=12.0 2023-06-26 22:08:15,774 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.11 vs. limit=10.0 2023-06-26 22:08:21,192 INFO [train.py:996] (2/4) Epoch 10, batch 2800, loss[loss=0.264, simple_loss=0.3556, pruned_loss=0.0862, over 21668.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2928, pruned_loss=0.07012, over 4267730.04 frames. ], batch size: 389, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 22:08:55,718 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1663572.0, ans=0.125 2023-06-26 22:08:56,359 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.60 vs. limit=15.0 2023-06-26 22:09:24,895 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.88 vs. limit=10.0 2023-06-26 22:09:27,635 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1663692.0, ans=0.0 2023-06-26 22:09:59,593 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1663752.0, ans=0.1 2023-06-26 22:10:00,641 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.922e+02 7.435e+02 1.264e+03 2.282e+03 6.620e+03, threshold=2.529e+03, percent-clipped=31.0 2023-06-26 22:10:11,266 INFO [train.py:996] (2/4) Epoch 10, batch 2850, loss[loss=0.1717, simple_loss=0.2431, pruned_loss=0.05013, over 21582.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2946, pruned_loss=0.0707, over 4262327.62 frames. ], batch size: 230, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 22:10:19,492 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.60 vs. limit=12.0 2023-06-26 22:10:39,488 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1663872.0, ans=0.125 2023-06-26 22:11:08,868 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1663932.0, ans=0.125 2023-06-26 22:11:45,993 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1664052.0, ans=0.125 2023-06-26 22:11:59,224 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.98 vs. limit=15.0 2023-06-26 22:11:59,718 INFO [train.py:996] (2/4) Epoch 10, batch 2900, loss[loss=0.2064, simple_loss=0.2769, pruned_loss=0.06794, over 21937.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2926, pruned_loss=0.07067, over 4265853.34 frames. ], batch size: 316, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 22:12:12,647 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1664112.0, ans=0.0 2023-06-26 22:12:12,750 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1664112.0, ans=0.0 2023-06-26 22:12:30,319 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.22 vs. limit=15.0 2023-06-26 22:13:05,213 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1664292.0, ans=0.125 2023-06-26 22:13:38,077 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.904e+02 5.286e+02 7.202e+02 1.145e+03 2.929e+03, threshold=1.440e+03, percent-clipped=1.0 2023-06-26 22:13:46,804 INFO [train.py:996] (2/4) Epoch 10, batch 2950, loss[loss=0.2033, simple_loss=0.2762, pruned_loss=0.06521, over 21865.00 frames. ], tot_loss[loss=0.2167, simple_loss=0.2928, pruned_loss=0.07033, over 4274245.46 frames. ], batch size: 298, lr: 3.00e-03, grad_scale: 8.0 2023-06-26 22:14:31,867 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1664532.0, ans=0.125 2023-06-26 22:15:40,778 INFO [train.py:996] (2/4) Epoch 10, batch 3000, loss[loss=0.221, simple_loss=0.3162, pruned_loss=0.06292, over 19889.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.2967, pruned_loss=0.07121, over 4277395.72 frames. ], batch size: 703, lr: 3.00e-03, grad_scale: 8.0 2023-06-26 22:15:40,779 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-26 22:15:58,656 INFO [train.py:1028] (2/4) Epoch 10, validation: loss=0.2517, simple_loss=0.3411, pruned_loss=0.08118, over 1796401.00 frames. 2023-06-26 22:15:58,657 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 23731MB 2023-06-26 22:16:04,969 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1664712.0, ans=0.125 2023-06-26 22:16:10,107 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1664712.0, ans=0.125 2023-06-26 22:17:21,203 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1664892.0, ans=0.0 2023-06-26 22:17:39,713 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.020e+02 5.823e+02 1.007e+03 1.425e+03 2.943e+03, threshold=2.014e+03, percent-clipped=25.0 2023-06-26 22:17:48,252 INFO [train.py:996] (2/4) Epoch 10, batch 3050, loss[loss=0.1999, simple_loss=0.2992, pruned_loss=0.05028, over 21678.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.2981, pruned_loss=0.06986, over 4273293.62 frames. ], batch size: 441, lr: 3.00e-03, grad_scale: 8.0 2023-06-26 22:19:02,170 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1665192.0, ans=0.125 2023-06-26 22:19:02,226 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1665192.0, ans=0.0 2023-06-26 22:19:37,772 INFO [train.py:996] (2/4) Epoch 10, batch 3100, loss[loss=0.2093, simple_loss=0.2947, pruned_loss=0.06191, over 21297.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.2974, pruned_loss=0.06906, over 4276380.19 frames. ], batch size: 548, lr: 3.00e-03, grad_scale: 8.0 2023-06-26 22:21:17,183 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.635e+02 5.384e+02 7.508e+02 1.175e+03 3.644e+03, threshold=1.502e+03, percent-clipped=4.0 2023-06-26 22:21:26,442 INFO [train.py:996] (2/4) Epoch 10, batch 3150, loss[loss=0.2252, simple_loss=0.3022, pruned_loss=0.07412, over 20719.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.2996, pruned_loss=0.06983, over 4277740.75 frames. ], batch size: 607, lr: 3.00e-03, grad_scale: 8.0 2023-06-26 22:21:27,213 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1665612.0, ans=0.1 2023-06-26 22:23:17,721 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1665852.0, ans=0.125 2023-06-26 22:23:22,069 INFO [train.py:996] (2/4) Epoch 10, batch 3200, loss[loss=0.2368, simple_loss=0.3272, pruned_loss=0.07317, over 21711.00 frames. ], tot_loss[loss=0.2212, simple_loss=0.3011, pruned_loss=0.07065, over 4282630.76 frames. ], batch size: 441, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 22:23:32,113 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1665912.0, ans=0.0 2023-06-26 22:24:23,870 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1666032.0, ans=0.07 2023-06-26 22:24:32,009 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1666092.0, ans=0.125 2023-06-26 22:24:42,598 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1666092.0, ans=0.1 2023-06-26 22:24:57,503 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.38 vs. limit=15.0 2023-06-26 22:25:01,076 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.433e+02 6.467e+02 1.041e+03 1.408e+03 2.668e+03, threshold=2.081e+03, percent-clipped=19.0 2023-06-26 22:25:14,950 INFO [train.py:996] (2/4) Epoch 10, batch 3250, loss[loss=0.1838, simple_loss=0.2531, pruned_loss=0.05727, over 21620.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.3016, pruned_loss=0.0714, over 4281039.40 frames. ], batch size: 231, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 22:25:17,245 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1666212.0, ans=0.125 2023-06-26 22:26:43,016 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1666452.0, ans=0.125 2023-06-26 22:26:43,075 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1666452.0, ans=0.125 2023-06-26 22:27:04,016 INFO [train.py:996] (2/4) Epoch 10, batch 3300, loss[loss=0.2029, simple_loss=0.3012, pruned_loss=0.05227, over 21839.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.2973, pruned_loss=0.07008, over 4284045.89 frames. ], batch size: 317, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 22:27:18,970 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1666512.0, ans=0.2 2023-06-26 22:27:43,747 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.78 vs. limit=22.5 2023-06-26 22:28:42,872 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.261e+02 7.426e+02 1.088e+03 1.707e+03 4.708e+03, threshold=2.176e+03, percent-clipped=17.0 2023-06-26 22:28:51,852 INFO [train.py:996] (2/4) Epoch 10, batch 3350, loss[loss=0.2419, simple_loss=0.3105, pruned_loss=0.08666, over 21346.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.2994, pruned_loss=0.07062, over 4281560.79 frames. ], batch size: 159, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 22:29:53,041 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.84 vs. limit=22.5 2023-06-26 22:30:02,282 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 22:30:12,862 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1666992.0, ans=0.0 2023-06-26 22:30:39,119 INFO [train.py:996] (2/4) Epoch 10, batch 3400, loss[loss=0.2121, simple_loss=0.2709, pruned_loss=0.07667, over 20010.00 frames. ], tot_loss[loss=0.221, simple_loss=0.2996, pruned_loss=0.07123, over 4284410.41 frames. ], batch size: 704, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 22:31:20,312 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1667172.0, ans=0.125 2023-06-26 22:31:25,407 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1667232.0, ans=0.125 2023-06-26 22:32:20,112 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.256e+02 6.513e+02 9.750e+02 1.536e+03 3.496e+03, threshold=1.950e+03, percent-clipped=9.0 2023-06-26 22:32:34,443 INFO [train.py:996] (2/4) Epoch 10, batch 3450, loss[loss=0.2059, simple_loss=0.2691, pruned_loss=0.07138, over 21312.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2948, pruned_loss=0.07064, over 4291174.85 frames. ], batch size: 131, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 22:32:37,577 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.00 vs. limit=12.0 2023-06-26 22:33:01,560 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1667472.0, ans=0.125 2023-06-26 22:33:08,913 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1667472.0, ans=0.125 2023-06-26 22:33:10,291 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1667472.0, ans=0.0 2023-06-26 22:34:24,152 INFO [train.py:996] (2/4) Epoch 10, batch 3500, loss[loss=0.2094, simple_loss=0.2838, pruned_loss=0.06749, over 21233.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.3014, pruned_loss=0.07382, over 4287292.80 frames. ], batch size: 608, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 22:35:01,731 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=6.18 vs. limit=22.5 2023-06-26 22:36:04,391 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.012e+02 7.162e+02 1.009e+03 1.814e+03 3.226e+03, threshold=2.018e+03, percent-clipped=21.0 2023-06-26 22:36:13,130 INFO [train.py:996] (2/4) Epoch 10, batch 3550, loss[loss=0.2236, simple_loss=0.3036, pruned_loss=0.07179, over 21394.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.305, pruned_loss=0.0749, over 4284906.50 frames. ], batch size: 131, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 22:36:22,245 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1668012.0, ans=0.2 2023-06-26 22:36:40,459 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1668072.0, ans=10.0 2023-06-26 22:37:15,783 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1668132.0, ans=0.125 2023-06-26 22:37:19,234 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1668192.0, ans=0.125 2023-06-26 22:37:25,092 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.84 vs. limit=15.0 2023-06-26 22:37:45,484 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1668252.0, ans=0.2 2023-06-26 22:38:06,114 INFO [train.py:996] (2/4) Epoch 10, batch 3600, loss[loss=0.2227, simple_loss=0.2938, pruned_loss=0.07577, over 21602.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.299, pruned_loss=0.07404, over 4275184.04 frames. ], batch size: 263, lr: 3.00e-03, grad_scale: 32.0 2023-06-26 22:38:26,642 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1668372.0, ans=0.0 2023-06-26 22:38:28,317 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1668372.0, ans=0.0 2023-06-26 22:39:07,020 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.40 vs. limit=10.0 2023-06-26 22:39:10,298 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.81 vs. limit=15.0 2023-06-26 22:39:42,552 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.944e+02 5.183e+02 6.801e+02 1.024e+03 2.371e+03, threshold=1.360e+03, percent-clipped=4.0 2023-06-26 22:39:54,941 INFO [train.py:996] (2/4) Epoch 10, batch 3650, loss[loss=0.194, simple_loss=0.3053, pruned_loss=0.0413, over 20848.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.2998, pruned_loss=0.07431, over 4272683.00 frames. ], batch size: 607, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 22:40:00,899 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1668612.0, ans=0.0 2023-06-26 22:40:47,626 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1668732.0, ans=0.2 2023-06-26 22:40:52,910 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1668732.0, ans=0.0 2023-06-26 22:40:57,754 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1668792.0, ans=0.125 2023-06-26 22:41:03,090 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1668792.0, ans=0.125 2023-06-26 22:41:22,379 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.22 vs. limit=22.5 2023-06-26 22:41:41,218 INFO [train.py:996] (2/4) Epoch 10, batch 3700, loss[loss=0.2509, simple_loss=0.3145, pruned_loss=0.09368, over 21799.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.2966, pruned_loss=0.07345, over 4274681.12 frames. ], batch size: 441, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 22:41:52,479 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1668912.0, ans=0.125 2023-06-26 22:42:16,460 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1668972.0, ans=0.0 2023-06-26 22:42:36,311 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1669032.0, ans=0.1 2023-06-26 22:43:23,450 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.784e+02 6.221e+02 8.574e+02 1.297e+03 2.866e+03, threshold=1.715e+03, percent-clipped=21.0 2023-06-26 22:43:30,705 INFO [train.py:996] (2/4) Epoch 10, batch 3750, loss[loss=0.2091, simple_loss=0.2844, pruned_loss=0.06691, over 21894.00 frames. ], tot_loss[loss=0.221, simple_loss=0.2963, pruned_loss=0.07285, over 4281084.71 frames. ], batch size: 351, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 22:43:58,882 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.90 vs. limit=10.0 2023-06-26 22:44:04,991 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 22:44:18,772 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1669332.0, ans=0.125 2023-06-26 22:45:17,817 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1669512.0, ans=0.0 2023-06-26 22:45:18,860 INFO [train.py:996] (2/4) Epoch 10, batch 3800, loss[loss=0.1963, simple_loss=0.2768, pruned_loss=0.05788, over 21628.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2933, pruned_loss=0.07075, over 4280897.29 frames. ], batch size: 263, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 22:45:45,731 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1669572.0, ans=0.125 2023-06-26 22:46:57,316 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1669752.0, ans=0.07 2023-06-26 22:46:58,174 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.737e+02 5.847e+02 8.030e+02 1.160e+03 2.493e+03, threshold=1.606e+03, percent-clipped=8.0 2023-06-26 22:47:10,256 INFO [train.py:996] (2/4) Epoch 10, batch 3850, loss[loss=0.1883, simple_loss=0.2533, pruned_loss=0.06166, over 21506.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.2924, pruned_loss=0.07149, over 4271206.52 frames. ], batch size: 230, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 22:47:39,971 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1669872.0, ans=0.125 2023-06-26 22:47:45,753 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.74 vs. limit=15.0 2023-06-26 22:47:52,073 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.20 vs. limit=22.5 2023-06-26 22:48:42,234 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1670052.0, ans=0.0 2023-06-26 22:48:45,427 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1670052.0, ans=0.125 2023-06-26 22:48:51,939 INFO [train.py:996] (2/4) Epoch 10, batch 3900, loss[loss=0.1932, simple_loss=0.273, pruned_loss=0.05667, over 21864.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2897, pruned_loss=0.07134, over 4269237.82 frames. ], batch size: 107, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 22:49:37,508 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.25 vs. limit=15.0 2023-06-26 22:49:37,746 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.22 vs. limit=15.0 2023-06-26 22:49:40,773 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1670232.0, ans=0.125 2023-06-26 22:49:45,876 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1670232.0, ans=0.125 2023-06-26 22:49:51,000 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1670232.0, ans=0.2 2023-06-26 22:50:17,837 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1670292.0, ans=0.0 2023-06-26 22:50:29,115 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.43 vs. limit=6.0 2023-06-26 22:50:40,118 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.341e+02 6.738e+02 9.125e+02 1.558e+03 3.098e+03, threshold=1.825e+03, percent-clipped=22.0 2023-06-26 22:50:47,231 INFO [train.py:996] (2/4) Epoch 10, batch 3950, loss[loss=0.1871, simple_loss=0.2583, pruned_loss=0.05798, over 21804.00 frames. ], tot_loss[loss=0.2164, simple_loss=0.2915, pruned_loss=0.07066, over 4263698.17 frames. ], batch size: 118, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 22:51:34,076 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1670532.0, ans=0.125 2023-06-26 22:51:52,202 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1670532.0, ans=0.125 2023-06-26 22:52:06,072 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.40 vs. limit=22.5 2023-06-26 22:52:35,954 INFO [train.py:996] (2/4) Epoch 10, batch 4000, loss[loss=0.1908, simple_loss=0.2567, pruned_loss=0.06246, over 21418.00 frames. ], tot_loss[loss=0.2107, simple_loss=0.2854, pruned_loss=0.06798, over 4264220.37 frames. ], batch size: 212, lr: 2.99e-03, grad_scale: 32.0 2023-06-26 22:52:36,866 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1670712.0, ans=0.1 2023-06-26 22:52:59,814 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.whiten.whitening_limit, batch_count=1670772.0, ans=12.0 2023-06-26 22:53:09,189 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.78 vs. limit=15.0 2023-06-26 22:53:12,592 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=1670772.0, ans=15.0 2023-06-26 22:53:47,365 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.17 vs. limit=22.5 2023-06-26 22:53:50,068 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1670892.0, ans=0.0 2023-06-26 22:54:19,975 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.363e+02 6.033e+02 8.423e+02 1.568e+03 3.555e+03, threshold=1.685e+03, percent-clipped=19.0 2023-06-26 22:54:31,311 INFO [train.py:996] (2/4) Epoch 10, batch 4050, loss[loss=0.2089, simple_loss=0.2995, pruned_loss=0.05912, over 21393.00 frames. ], tot_loss[loss=0.2094, simple_loss=0.2851, pruned_loss=0.0668, over 4265877.76 frames. ], batch size: 211, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 22:54:37,188 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1671012.0, ans=0.125 2023-06-26 22:54:54,223 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1671072.0, ans=0.125 2023-06-26 22:55:17,405 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1671132.0, ans=0.1 2023-06-26 22:56:20,361 INFO [train.py:996] (2/4) Epoch 10, batch 4100, loss[loss=0.2584, simple_loss=0.3158, pruned_loss=0.1005, over 21687.00 frames. ], tot_loss[loss=0.2109, simple_loss=0.2875, pruned_loss=0.06712, over 4275489.48 frames. ], batch size: 507, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 22:57:11,283 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1671432.0, ans=0.0 2023-06-26 22:57:15,217 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.03 vs. limit=12.0 2023-06-26 22:57:23,200 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1671492.0, ans=0.0 2023-06-26 22:57:54,950 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1671552.0, ans=0.0 2023-06-26 22:57:57,625 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.779e+02 5.678e+02 9.516e+02 1.395e+03 3.425e+03, threshold=1.903e+03, percent-clipped=17.0 2023-06-26 22:58:02,755 INFO [train.py:996] (2/4) Epoch 10, batch 4150, loss[loss=0.2296, simple_loss=0.3053, pruned_loss=0.07695, over 21553.00 frames. ], tot_loss[loss=0.2096, simple_loss=0.289, pruned_loss=0.06508, over 4267117.17 frames. ], batch size: 441, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 22:58:19,499 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=1671672.0, ans=15.0 2023-06-26 22:58:20,724 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1671672.0, ans=0.2 2023-06-26 22:59:38,604 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1671852.0, ans=0.125 2023-06-26 22:59:48,072 INFO [train.py:996] (2/4) Epoch 10, batch 4200, loss[loss=0.1891, simple_loss=0.2599, pruned_loss=0.0592, over 21819.00 frames. ], tot_loss[loss=0.2088, simple_loss=0.2887, pruned_loss=0.06442, over 4268411.26 frames. ], batch size: 107, lr: 2.99e-03, grad_scale: 8.0 2023-06-26 23:00:28,153 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1672032.0, ans=0.09899494936611666 2023-06-26 23:00:34,911 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1672032.0, ans=0.0 2023-06-26 23:01:29,529 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.633e+02 4.957e+02 6.956e+02 1.176e+03 3.842e+03, threshold=1.391e+03, percent-clipped=7.0 2023-06-26 23:01:33,259 INFO [train.py:996] (2/4) Epoch 10, batch 4250, loss[loss=0.2223, simple_loss=0.2884, pruned_loss=0.07808, over 20854.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.2945, pruned_loss=0.06671, over 4268037.35 frames. ], batch size: 611, lr: 2.99e-03, grad_scale: 8.0 2023-06-26 23:02:11,600 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=1672272.0, ans=15.0 2023-06-26 23:02:30,449 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1672332.0, ans=0.05 2023-06-26 23:02:57,771 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1672392.0, ans=0.125 2023-06-26 23:03:03,815 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1672452.0, ans=0.0 2023-06-26 23:03:30,201 INFO [train.py:996] (2/4) Epoch 10, batch 4300, loss[loss=0.2024, simple_loss=0.3167, pruned_loss=0.04404, over 20749.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.3007, pruned_loss=0.06811, over 4271469.08 frames. ], batch size: 608, lr: 2.99e-03, grad_scale: 8.0 2023-06-26 23:03:43,549 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1672512.0, ans=0.2 2023-06-26 23:04:39,255 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1672692.0, ans=0.0 2023-06-26 23:05:10,568 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 23:05:15,316 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.979e+02 6.207e+02 8.849e+02 1.440e+03 4.327e+03, threshold=1.770e+03, percent-clipped=25.0 2023-06-26 23:05:17,761 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1672812.0, ans=10.0 2023-06-26 23:05:17,818 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1672812.0, ans=0.0 2023-06-26 23:05:18,711 INFO [train.py:996] (2/4) Epoch 10, batch 4350, loss[loss=0.1997, simple_loss=0.2727, pruned_loss=0.06341, over 21603.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.2992, pruned_loss=0.06723, over 4267057.51 frames. ], batch size: 298, lr: 2.99e-03, grad_scale: 8.0 2023-06-26 23:07:07,226 INFO [train.py:996] (2/4) Epoch 10, batch 4400, loss[loss=0.1921, simple_loss=0.2631, pruned_loss=0.06059, over 21764.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.2955, pruned_loss=0.06685, over 4269949.07 frames. ], batch size: 112, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:07:08,397 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.12 vs. limit=22.5 2023-06-26 23:07:39,111 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1673172.0, ans=0.0 2023-06-26 23:07:44,165 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1673172.0, ans=0.2 2023-06-26 23:07:50,376 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.39 vs. limit=15.0 2023-06-26 23:07:51,549 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1673172.0, ans=0.2 2023-06-26 23:08:52,417 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.985e+02 5.856e+02 8.779e+02 1.198e+03 2.482e+03, threshold=1.756e+03, percent-clipped=8.0 2023-06-26 23:08:56,191 INFO [train.py:996] (2/4) Epoch 10, batch 4450, loss[loss=0.2311, simple_loss=0.2891, pruned_loss=0.08649, over 21282.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.3035, pruned_loss=0.06841, over 4275772.23 frames. ], batch size: 471, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:09:25,199 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1673472.0, ans=0.2 2023-06-26 23:09:33,808 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1673472.0, ans=0.1 2023-06-26 23:09:44,236 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1673532.0, ans=0.125 2023-06-26 23:09:48,619 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.61 vs. limit=12.0 2023-06-26 23:09:51,478 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1673532.0, ans=0.125 2023-06-26 23:10:35,671 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.73 vs. limit=15.0 2023-06-26 23:10:45,097 INFO [train.py:996] (2/4) Epoch 10, batch 4500, loss[loss=0.2236, simple_loss=0.3408, pruned_loss=0.05323, over 21202.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.3058, pruned_loss=0.07031, over 4284530.29 frames. ], batch size: 548, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:11:40,847 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1673832.0, ans=0.0 2023-06-26 23:11:54,262 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1673832.0, ans=0.125 2023-06-26 23:12:25,249 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.22 vs. limit=15.0 2023-06-26 23:12:26,622 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1673952.0, ans=0.1 2023-06-26 23:12:31,379 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.557e+02 6.437e+02 9.027e+02 1.407e+03 3.220e+03, threshold=1.805e+03, percent-clipped=13.0 2023-06-26 23:12:39,265 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1674012.0, ans=0.125 2023-06-26 23:12:46,677 INFO [train.py:996] (2/4) Epoch 10, batch 4550, loss[loss=0.2457, simple_loss=0.3267, pruned_loss=0.08239, over 21481.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.3073, pruned_loss=0.07045, over 4278675.29 frames. ], batch size: 211, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:12:50,820 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1674012.0, ans=0.0 2023-06-26 23:13:39,079 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.15 vs. limit=15.0 2023-06-26 23:14:05,922 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=1674252.0, ans=10.0 2023-06-26 23:14:09,425 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1674252.0, ans=0.125 2023-06-26 23:14:33,439 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1674312.0, ans=0.0 2023-06-26 23:14:34,569 INFO [train.py:996] (2/4) Epoch 10, batch 4600, loss[loss=0.1913, simple_loss=0.2708, pruned_loss=0.05591, over 21328.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.3101, pruned_loss=0.07162, over 4280629.09 frames. ], batch size: 176, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:14:52,659 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1674372.0, ans=0.125 2023-06-26 23:15:37,826 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 23:16:17,989 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.163e+02 6.181e+02 9.452e+02 1.480e+03 3.323e+03, threshold=1.890e+03, percent-clipped=16.0 2023-06-26 23:16:20,483 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1674612.0, ans=0.0 2023-06-26 23:16:21,532 INFO [train.py:996] (2/4) Epoch 10, batch 4650, loss[loss=0.1595, simple_loss=0.2342, pruned_loss=0.04239, over 21495.00 frames. ], tot_loss[loss=0.223, simple_loss=0.3045, pruned_loss=0.0707, over 4276632.20 frames. ], batch size: 212, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:17:02,025 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1674672.0, ans=0.2 2023-06-26 23:17:37,655 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1674792.0, ans=0.0 2023-06-26 23:18:08,105 INFO [train.py:996] (2/4) Epoch 10, batch 4700, loss[loss=0.1835, simple_loss=0.2512, pruned_loss=0.05795, over 21236.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2945, pruned_loss=0.0681, over 4279444.86 frames. ], batch size: 159, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:18:20,817 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1674912.0, ans=0.125 2023-06-26 23:18:38,596 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.19 vs. limit=15.0 2023-06-26 23:19:29,261 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.95 vs. limit=15.0 2023-06-26 23:19:50,880 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.765e+02 4.747e+02 5.523e+02 7.889e+02 1.677e+03, threshold=1.105e+03, percent-clipped=0.0 2023-06-26 23:19:54,060 INFO [train.py:996] (2/4) Epoch 10, batch 4750, loss[loss=0.2089, simple_loss=0.2739, pruned_loss=0.0719, over 21603.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.2886, pruned_loss=0.06814, over 4278847.05 frames. ], batch size: 263, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:19:59,727 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1675212.0, ans=0.125 2023-06-26 23:20:19,018 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1675272.0, ans=0.5 2023-06-26 23:20:28,803 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1675272.0, ans=0.125 2023-06-26 23:21:34,267 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1675452.0, ans=0.125 2023-06-26 23:21:41,804 INFO [train.py:996] (2/4) Epoch 10, batch 4800, loss[loss=0.198, simple_loss=0.2694, pruned_loss=0.06333, over 21135.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.289, pruned_loss=0.06829, over 4285790.69 frames. ], batch size: 143, lr: 2.99e-03, grad_scale: 32.0 2023-06-26 23:22:23,962 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.25 vs. limit=15.0 2023-06-26 23:22:34,944 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1675632.0, ans=0.0 2023-06-26 23:22:36,750 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1675632.0, ans=0.0 2023-06-26 23:23:25,458 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.997e+02 5.704e+02 8.592e+02 1.252e+03 2.093e+03, threshold=1.718e+03, percent-clipped=31.0 2023-06-26 23:23:27,162 INFO [train.py:996] (2/4) Epoch 10, batch 4850, loss[loss=0.2169, simple_loss=0.2823, pruned_loss=0.07574, over 21845.00 frames. ], tot_loss[loss=0.2125, simple_loss=0.2889, pruned_loss=0.068, over 4284146.92 frames. ], batch size: 118, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:23:44,823 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1675812.0, ans=0.1 2023-06-26 23:24:00,856 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.04 vs. limit=22.5 2023-06-26 23:24:14,474 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1675932.0, ans=0.2 2023-06-26 23:24:28,207 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1675992.0, ans=0.0 2023-06-26 23:24:34,469 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.95 vs. limit=15.0 2023-06-26 23:24:34,501 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.33 vs. limit=15.0 2023-06-26 23:25:15,484 INFO [train.py:996] (2/4) Epoch 10, batch 4900, loss[loss=0.2583, simple_loss=0.3903, pruned_loss=0.06315, over 20834.00 frames. ], tot_loss[loss=0.2142, simple_loss=0.2909, pruned_loss=0.06876, over 4283803.56 frames. ], batch size: 607, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:25:59,790 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1676232.0, ans=0.1 2023-06-26 23:26:33,895 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.73 vs. limit=15.0 2023-06-26 23:26:34,955 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1676292.0, ans=0.2 2023-06-26 23:26:45,698 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1676352.0, ans=0.0 2023-06-26 23:26:54,770 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.23 vs. limit=15.0 2023-06-26 23:27:07,354 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.910e+02 6.746e+02 9.232e+02 1.272e+03 2.922e+03, threshold=1.846e+03, percent-clipped=7.0 2023-06-26 23:27:08,951 INFO [train.py:996] (2/4) Epoch 10, batch 4950, loss[loss=0.1863, simple_loss=0.2885, pruned_loss=0.04201, over 21770.00 frames. ], tot_loss[loss=0.2147, simple_loss=0.2947, pruned_loss=0.06736, over 4283382.00 frames. ], batch size: 282, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:27:20,064 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1676412.0, ans=0.1 2023-06-26 23:27:21,777 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1676412.0, ans=0.0 2023-06-26 23:27:37,676 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.98 vs. limit=15.0 2023-06-26 23:27:49,359 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.57 vs. limit=15.0 2023-06-26 23:27:50,385 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1676532.0, ans=0.125 2023-06-26 23:28:04,399 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1676592.0, ans=0.125 2023-06-26 23:28:25,253 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1676592.0, ans=0.2 2023-06-26 23:28:39,400 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1676652.0, ans=0.125 2023-06-26 23:28:50,823 INFO [train.py:996] (2/4) Epoch 10, batch 5000, loss[loss=0.1992, simple_loss=0.2845, pruned_loss=0.05694, over 21611.00 frames. ], tot_loss[loss=0.2123, simple_loss=0.294, pruned_loss=0.06532, over 4280509.83 frames. ], batch size: 230, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:29:52,845 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 23:30:21,033 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1676952.0, ans=0.125 2023-06-26 23:30:35,687 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.540e+02 5.923e+02 8.910e+02 1.386e+03 2.915e+03, threshold=1.782e+03, percent-clipped=9.0 2023-06-26 23:30:36,538 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1677012.0, ans=0.125 2023-06-26 23:30:37,444 INFO [train.py:996] (2/4) Epoch 10, batch 5050, loss[loss=0.2026, simple_loss=0.2784, pruned_loss=0.06343, over 21822.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2939, pruned_loss=0.06651, over 4286877.32 frames. ], batch size: 282, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:30:58,687 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1677012.0, ans=0.125 2023-06-26 23:31:38,655 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1677192.0, ans=0.1 2023-06-26 23:32:14,379 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1677252.0, ans=0.2 2023-06-26 23:32:18,461 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.76 vs. limit=15.0 2023-06-26 23:32:22,437 INFO [train.py:996] (2/4) Epoch 10, batch 5100, loss[loss=0.1629, simple_loss=0.2492, pruned_loss=0.03829, over 21797.00 frames. ], tot_loss[loss=0.2123, simple_loss=0.2911, pruned_loss=0.06674, over 4295231.82 frames. ], batch size: 247, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:32:22,981 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1677312.0, ans=0.2 2023-06-26 23:33:47,919 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.23 vs. limit=12.0 2023-06-26 23:34:07,924 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.053e+02 6.342e+02 8.169e+02 1.053e+03 2.713e+03, threshold=1.634e+03, percent-clipped=6.0 2023-06-26 23:34:09,495 INFO [train.py:996] (2/4) Epoch 10, batch 5150, loss[loss=0.1918, simple_loss=0.2694, pruned_loss=0.05709, over 21628.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.2895, pruned_loss=0.06683, over 4298863.71 frames. ], batch size: 263, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:34:56,668 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1677732.0, ans=0.125 2023-06-26 23:35:17,341 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1677792.0, ans=0.125 2023-06-26 23:35:47,091 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1677852.0, ans=0.0 2023-06-26 23:36:03,524 INFO [train.py:996] (2/4) Epoch 10, batch 5200, loss[loss=0.2142, simple_loss=0.3143, pruned_loss=0.05702, over 21853.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.2919, pruned_loss=0.06783, over 4292727.17 frames. ], batch size: 316, lr: 2.99e-03, grad_scale: 32.0 2023-06-26 23:36:06,051 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1677912.0, ans=0.1 2023-06-26 23:36:14,979 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff2.min_abs, batch_count=1677912.0, ans=0.1 2023-06-26 23:36:37,855 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.88 vs. limit=22.5 2023-06-26 23:36:47,996 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1678032.0, ans=0.125 2023-06-26 23:37:47,727 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1678152.0, ans=0.125 2023-06-26 23:37:50,417 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.940e+02 5.817e+02 8.011e+02 1.324e+03 3.418e+03, threshold=1.602e+03, percent-clipped=14.0 2023-06-26 23:37:50,448 INFO [train.py:996] (2/4) Epoch 10, batch 5250, loss[loss=0.1956, simple_loss=0.2769, pruned_loss=0.05716, over 21777.00 frames. ], tot_loss[loss=0.2158, simple_loss=0.297, pruned_loss=0.06728, over 4288403.66 frames. ], batch size: 112, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:38:02,974 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1678212.0, ans=0.05 2023-06-26 23:38:15,920 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1678272.0, ans=0.1 2023-06-26 23:38:32,702 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1678332.0, ans=0.125 2023-06-26 23:39:35,327 INFO [train.py:996] (2/4) Epoch 10, batch 5300, loss[loss=0.2075, simple_loss=0.2799, pruned_loss=0.06757, over 21903.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.2956, pruned_loss=0.06705, over 4288427.01 frames. ], batch size: 371, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:39:40,031 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.31 vs. limit=15.0 2023-06-26 23:40:43,095 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1678692.0, ans=0.1 2023-06-26 23:40:49,420 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1678692.0, ans=0.0 2023-06-26 23:40:56,486 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1678692.0, ans=0.0 2023-06-26 23:41:03,307 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1678752.0, ans=10.0 2023-06-26 23:41:21,209 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.821e+02 5.421e+02 7.005e+02 9.056e+02 1.380e+03, threshold=1.401e+03, percent-clipped=0.0 2023-06-26 23:41:21,249 INFO [train.py:996] (2/4) Epoch 10, batch 5350, loss[loss=0.2117, simple_loss=0.2822, pruned_loss=0.07067, over 21722.00 frames. ], tot_loss[loss=0.215, simple_loss=0.2941, pruned_loss=0.06794, over 4289366.27 frames. ], batch size: 112, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:41:34,208 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.05 vs. limit=15.0 2023-06-26 23:41:45,436 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1678872.0, ans=0.125 2023-06-26 23:42:05,759 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1678932.0, ans=0.2 2023-06-26 23:42:44,751 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1679052.0, ans=0.125 2023-06-26 23:42:58,230 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1679052.0, ans=0.125 2023-06-26 23:42:59,799 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1679052.0, ans=0.125 2023-06-26 23:43:03,281 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1679052.0, ans=0.125 2023-06-26 23:43:05,915 INFO [train.py:996] (2/4) Epoch 10, batch 5400, loss[loss=0.208, simple_loss=0.2826, pruned_loss=0.06676, over 21754.00 frames. ], tot_loss[loss=0.2142, simple_loss=0.2914, pruned_loss=0.06853, over 4299494.53 frames. ], batch size: 441, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:43:15,481 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1679112.0, ans=0.125 2023-06-26 23:43:45,233 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1679232.0, ans=0.0 2023-06-26 23:44:11,349 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1679232.0, ans=0.0 2023-06-26 23:44:52,993 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1679412.0, ans=0.0 2023-06-26 23:44:53,965 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.666e+02 6.862e+02 1.175e+03 1.926e+03 4.033e+03, threshold=2.351e+03, percent-clipped=41.0 2023-06-26 23:44:54,008 INFO [train.py:996] (2/4) Epoch 10, batch 5450, loss[loss=0.2692, simple_loss=0.351, pruned_loss=0.09367, over 21527.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.2928, pruned_loss=0.06735, over 4294844.57 frames. ], batch size: 471, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:45:22,404 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1679472.0, ans=0.125 2023-06-26 23:46:50,762 INFO [train.py:996] (2/4) Epoch 10, batch 5500, loss[loss=0.25, simple_loss=0.3457, pruned_loss=0.07712, over 21481.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.2985, pruned_loss=0.06551, over 4295828.09 frames. ], batch size: 471, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:46:55,741 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.01 vs. limit=22.5 2023-06-26 23:47:08,186 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1679712.0, ans=0.1 2023-06-26 23:47:37,158 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1679832.0, ans=0.1 2023-06-26 23:48:04,068 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1679892.0, ans=0.025 2023-06-26 23:48:24,093 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1679952.0, ans=0.125 2023-06-26 23:48:40,418 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1679952.0, ans=0.1 2023-06-26 23:48:48,466 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.727e+02 5.357e+02 7.450e+02 1.317e+03 3.051e+03, threshold=1.490e+03, percent-clipped=6.0 2023-06-26 23:48:48,499 INFO [train.py:996] (2/4) Epoch 10, batch 5550, loss[loss=0.2029, simple_loss=0.298, pruned_loss=0.05392, over 21710.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.2993, pruned_loss=0.06296, over 4288567.35 frames. ], batch size: 298, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:49:02,168 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.88 vs. limit=10.0 2023-06-26 23:49:11,982 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1680072.0, ans=0.0 2023-06-26 23:49:37,938 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1680132.0, ans=0.0 2023-06-26 23:50:29,191 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1680252.0, ans=0.125 2023-06-26 23:50:38,657 INFO [train.py:996] (2/4) Epoch 10, batch 5600, loss[loss=0.3215, simple_loss=0.4079, pruned_loss=0.1176, over 21410.00 frames. ], tot_loss[loss=0.2092, simple_loss=0.2974, pruned_loss=0.06053, over 4282119.63 frames. ], batch size: 507, lr: 2.99e-03, grad_scale: 32.0 2023-06-26 23:50:39,221 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1680312.0, ans=0.0 2023-06-26 23:50:48,060 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1680312.0, ans=0.2 2023-06-26 23:50:53,203 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1680312.0, ans=0.1 2023-06-26 23:51:06,687 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1680372.0, ans=0.1 2023-06-26 23:51:24,300 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1680432.0, ans=0.125 2023-06-26 23:51:34,600 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1680432.0, ans=0.0 2023-06-26 23:52:17,165 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1680552.0, ans=0.09899494936611666 2023-06-26 23:52:25,087 INFO [train.py:996] (2/4) Epoch 10, batch 5650, loss[loss=0.256, simple_loss=0.3271, pruned_loss=0.09244, over 21757.00 frames. ], tot_loss[loss=0.214, simple_loss=0.3015, pruned_loss=0.06326, over 4277775.42 frames. ], batch size: 441, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:52:27,143 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.741e+02 5.468e+02 7.224e+02 1.167e+03 2.877e+03, threshold=1.445e+03, percent-clipped=12.0 2023-06-26 23:52:45,538 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1680612.0, ans=0.05 2023-06-26 23:53:00,999 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1680672.0, ans=0.125 2023-06-26 23:53:28,023 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1680792.0, ans=0.125 2023-06-26 23:54:07,677 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.04 vs. limit=15.0 2023-06-26 23:54:13,526 INFO [train.py:996] (2/4) Epoch 10, batch 5700, loss[loss=0.2006, simple_loss=0.266, pruned_loss=0.06756, over 21252.00 frames. ], tot_loss[loss=0.2149, simple_loss=0.3, pruned_loss=0.06484, over 4276137.61 frames. ], batch size: 608, lr: 2.98e-03, grad_scale: 16.0 2023-06-26 23:56:09,532 INFO [train.py:996] (2/4) Epoch 10, batch 5750, loss[loss=0.1767, simple_loss=0.2666, pruned_loss=0.0434, over 21676.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.2972, pruned_loss=0.06261, over 4271961.73 frames. ], batch size: 263, lr: 2.98e-03, grad_scale: 16.0 2023-06-26 23:56:11,424 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.677e+02 6.670e+02 9.043e+02 1.357e+03 3.417e+03, threshold=1.809e+03, percent-clipped=19.0 2023-06-26 23:56:35,830 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1681272.0, ans=0.2 2023-06-26 23:56:40,708 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1681272.0, ans=0.1 2023-06-26 23:56:44,306 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1681272.0, ans=0.125 2023-06-26 23:57:00,639 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.88 vs. limit=10.0 2023-06-26 23:57:34,788 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1681392.0, ans=0.0 2023-06-26 23:57:38,056 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1681392.0, ans=0.125 2023-06-26 23:57:55,452 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1681452.0, ans=0.0 2023-06-26 23:57:58,038 INFO [train.py:996] (2/4) Epoch 10, batch 5800, loss[loss=0.2167, simple_loss=0.3157, pruned_loss=0.05883, over 21829.00 frames. ], tot_loss[loss=0.2076, simple_loss=0.2947, pruned_loss=0.06028, over 4277713.87 frames. ], batch size: 316, lr: 2.98e-03, grad_scale: 16.0 2023-06-26 23:58:26,659 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1681572.0, ans=0.0 2023-06-26 23:58:57,970 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1681632.0, ans=0.125 2023-06-26 23:58:58,025 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1681632.0, ans=0.0 2023-06-26 23:59:03,214 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1681632.0, ans=0.125 2023-06-26 23:59:15,543 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1681692.0, ans=0.125 2023-06-26 23:59:16,223 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=8.00 vs. limit=15.0 2023-06-26 23:59:46,313 INFO [train.py:996] (2/4) Epoch 10, batch 5850, loss[loss=0.1658, simple_loss=0.2663, pruned_loss=0.03267, over 21729.00 frames. ], tot_loss[loss=0.2041, simple_loss=0.2934, pruned_loss=0.05736, over 4275885.13 frames. ], batch size: 247, lr: 2.98e-03, grad_scale: 16.0 2023-06-26 23:59:53,508 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.721e+02 4.995e+02 7.881e+02 1.168e+03 2.434e+03, threshold=1.576e+03, percent-clipped=1.0 2023-06-27 00:00:02,694 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1681812.0, ans=0.1 2023-06-27 00:00:08,047 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1681872.0, ans=0.125 2023-06-27 00:01:08,027 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1681992.0, ans=0.1 2023-06-27 00:01:11,290 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1681992.0, ans=0.0 2023-06-27 00:01:37,803 INFO [train.py:996] (2/4) Epoch 10, batch 5900, loss[loss=0.1887, simple_loss=0.2569, pruned_loss=0.06022, over 21243.00 frames. ], tot_loss[loss=0.1967, simple_loss=0.2865, pruned_loss=0.05347, over 4273279.23 frames. ], batch size: 608, lr: 2.98e-03, grad_scale: 16.0 2023-06-27 00:01:40,079 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1682112.0, ans=0.125 2023-06-27 00:02:29,664 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1682232.0, ans=0.125 2023-06-27 00:02:59,432 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.73 vs. limit=15.0 2023-06-27 00:03:00,379 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1682352.0, ans=0.0 2023-06-27 00:03:17,839 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1682352.0, ans=0.2 2023-06-27 00:03:24,124 INFO [train.py:996] (2/4) Epoch 10, batch 5950, loss[loss=0.1905, simple_loss=0.2543, pruned_loss=0.06332, over 21581.00 frames. ], tot_loss[loss=0.1982, simple_loss=0.2841, pruned_loss=0.05615, over 4276178.96 frames. ], batch size: 263, lr: 2.98e-03, grad_scale: 16.0 2023-06-27 00:03:25,856 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.299e+02 4.862e+02 7.145e+02 9.461e+02 2.592e+03, threshold=1.429e+03, percent-clipped=2.0 2023-06-27 00:04:07,696 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1682532.0, ans=0.125 2023-06-27 00:04:26,087 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 00:04:36,016 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1682592.0, ans=0.1 2023-06-27 00:05:08,657 INFO [train.py:996] (2/4) Epoch 10, batch 6000, loss[loss=0.1782, simple_loss=0.2407, pruned_loss=0.05785, over 21470.00 frames. ], tot_loss[loss=0.1991, simple_loss=0.2803, pruned_loss=0.05891, over 4272529.10 frames. ], batch size: 212, lr: 2.98e-03, grad_scale: 32.0 2023-06-27 00:05:08,657 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-27 00:05:29,833 INFO [train.py:1028] (2/4) Epoch 10, validation: loss=0.2604, simple_loss=0.3533, pruned_loss=0.08374, over 1796401.00 frames. 2023-06-27 00:05:29,834 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 23731MB 2023-06-27 00:06:12,875 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1682832.0, ans=0.125 2023-06-27 00:06:20,030 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1682832.0, ans=0.0 2023-06-27 00:06:48,454 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1682952.0, ans=0.125 2023-06-27 00:07:18,981 INFO [train.py:996] (2/4) Epoch 10, batch 6050, loss[loss=0.2097, simple_loss=0.2773, pruned_loss=0.07104, over 21834.00 frames. ], tot_loss[loss=0.197, simple_loss=0.2753, pruned_loss=0.05936, over 4278990.73 frames. ], batch size: 107, lr: 2.98e-03, grad_scale: 8.0 2023-06-27 00:07:24,252 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.965e+02 5.435e+02 7.983e+02 1.281e+03 2.662e+03, threshold=1.597e+03, percent-clipped=18.0 2023-06-27 00:07:34,364 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.90 vs. limit=22.5 2023-06-27 00:07:49,586 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1683072.0, ans=0.1 2023-06-27 00:07:51,350 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1683072.0, ans=0.0 2023-06-27 00:08:33,729 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1683192.0, ans=0.1 2023-06-27 00:08:36,394 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.72 vs. limit=15.0 2023-06-27 00:09:06,573 INFO [train.py:996] (2/4) Epoch 10, batch 6100, loss[loss=0.2303, simple_loss=0.2987, pruned_loss=0.08088, over 21520.00 frames. ], tot_loss[loss=0.1966, simple_loss=0.2754, pruned_loss=0.05888, over 4281715.41 frames. ], batch size: 131, lr: 2.98e-03, grad_scale: 8.0 2023-06-27 00:10:14,313 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 00:10:17,433 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 00:10:23,210 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1683492.0, ans=0.0 2023-06-27 00:10:43,374 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1683552.0, ans=0.125 2023-06-27 00:10:50,492 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1683552.0, ans=0.0 2023-06-27 00:10:52,281 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1683612.0, ans=0.0 2023-06-27 00:10:53,288 INFO [train.py:996] (2/4) Epoch 10, batch 6150, loss[loss=0.19, simple_loss=0.2685, pruned_loss=0.05573, over 21500.00 frames. ], tot_loss[loss=0.2, simple_loss=0.2789, pruned_loss=0.06058, over 4288901.36 frames. ], batch size: 195, lr: 2.98e-03, grad_scale: 8.0 2023-06-27 00:10:58,722 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.616e+02 5.589e+02 9.647e+02 1.302e+03 3.090e+03, threshold=1.929e+03, percent-clipped=16.0 2023-06-27 00:10:59,178 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1683612.0, ans=0.0 2023-06-27 00:11:46,411 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1683732.0, ans=0.0 2023-06-27 00:11:48,436 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.85 vs. limit=22.5 2023-06-27 00:12:10,704 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1683792.0, ans=0.0 2023-06-27 00:12:42,239 INFO [train.py:996] (2/4) Epoch 10, batch 6200, loss[loss=0.2213, simple_loss=0.2989, pruned_loss=0.07189, over 21369.00 frames. ], tot_loss[loss=0.2035, simple_loss=0.2834, pruned_loss=0.06183, over 4287118.53 frames. ], batch size: 211, lr: 2.98e-03, grad_scale: 8.0 2023-06-27 00:12:51,988 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1683912.0, ans=0.125 2023-06-27 00:13:35,783 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1684032.0, ans=0.07 2023-06-27 00:14:03,044 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.48 vs. limit=22.5 2023-06-27 00:14:18,900 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.90 vs. limit=22.5 2023-06-27 00:14:31,223 INFO [train.py:996] (2/4) Epoch 10, batch 6250, loss[loss=0.2075, simple_loss=0.2927, pruned_loss=0.06118, over 21465.00 frames. ], tot_loss[loss=0.2057, simple_loss=0.2872, pruned_loss=0.06208, over 4284683.45 frames. ], batch size: 211, lr: 2.98e-03, grad_scale: 8.0 2023-06-27 00:14:36,262 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.907e+02 5.995e+02 9.540e+02 1.636e+03 4.135e+03, threshold=1.908e+03, percent-clipped=20.0 2023-06-27 00:14:54,027 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1684272.0, ans=0.125 2023-06-27 00:15:02,379 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 00:15:43,856 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.26 vs. limit=6.0 2023-06-27 00:16:09,741 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1684452.0, ans=0.1 2023-06-27 00:16:15,732 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=1684512.0, ans=15.0 2023-06-27 00:16:16,366 INFO [train.py:996] (2/4) Epoch 10, batch 6300, loss[loss=0.2044, simple_loss=0.3059, pruned_loss=0.05144, over 20876.00 frames. ], tot_loss[loss=0.2064, simple_loss=0.2906, pruned_loss=0.06105, over 4294361.00 frames. ], batch size: 608, lr: 2.98e-03, grad_scale: 8.0 2023-06-27 00:16:38,018 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.96 vs. limit=22.5 2023-06-27 00:17:09,546 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.74 vs. limit=22.5 2023-06-27 00:17:10,799 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1684632.0, ans=0.09899494936611666 2023-06-27 00:18:08,450 INFO [train.py:996] (2/4) Epoch 10, batch 6350, loss[loss=0.2333, simple_loss=0.3074, pruned_loss=0.07966, over 21819.00 frames. ], tot_loss[loss=0.2106, simple_loss=0.2936, pruned_loss=0.06382, over 4285257.47 frames. ], batch size: 351, lr: 2.98e-03, grad_scale: 8.0 2023-06-27 00:18:13,743 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.722e+02 5.276e+02 6.494e+02 9.126e+02 1.517e+03, threshold=1.299e+03, percent-clipped=0.0 2023-06-27 00:19:31,486 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.33 vs. limit=12.0 2023-06-27 00:19:32,546 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1684992.0, ans=0.04949747468305833 2023-06-27 00:19:34,106 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1685052.0, ans=0.04949747468305833 2023-06-27 00:19:53,562 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1685052.0, ans=0.125 2023-06-27 00:19:57,959 INFO [train.py:996] (2/4) Epoch 10, batch 6400, loss[loss=0.2434, simple_loss=0.3189, pruned_loss=0.08397, over 21596.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.2991, pruned_loss=0.06791, over 4282919.31 frames. ], batch size: 415, lr: 2.98e-03, grad_scale: 16.0 2023-06-27 00:21:03,012 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1685232.0, ans=10.0 2023-06-27 00:21:21,799 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1685292.0, ans=0.0 2023-06-27 00:21:51,044 INFO [train.py:996] (2/4) Epoch 10, batch 6450, loss[loss=0.1972, simple_loss=0.291, pruned_loss=0.05175, over 21409.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.3004, pruned_loss=0.06688, over 4283373.79 frames. ], batch size: 211, lr: 2.98e-03, grad_scale: 16.0 2023-06-27 00:21:55,002 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1685412.0, ans=0.2 2023-06-27 00:21:55,955 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.237e+02 6.943e+02 1.024e+03 1.521e+03 2.741e+03, threshold=2.048e+03, percent-clipped=32.0 2023-06-27 00:23:37,626 INFO [train.py:996] (2/4) Epoch 10, batch 6500, loss[loss=0.1787, simple_loss=0.2648, pruned_loss=0.04629, over 21521.00 frames. ], tot_loss[loss=0.212, simple_loss=0.2929, pruned_loss=0.06552, over 4270334.68 frames. ], batch size: 389, lr: 2.98e-03, grad_scale: 16.0 2023-06-27 00:23:57,530 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.23 vs. limit=15.0 2023-06-27 00:24:03,878 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.27 vs. limit=15.0 2023-06-27 00:24:10,376 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1685772.0, ans=0.125 2023-06-27 00:24:50,869 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1685892.0, ans=0.125 2023-06-27 00:24:56,179 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1685892.0, ans=0.125 2023-06-27 00:25:01,645 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1685952.0, ans=0.5 2023-06-27 00:25:23,188 INFO [train.py:996] (2/4) Epoch 10, batch 6550, loss[loss=0.2285, simple_loss=0.3333, pruned_loss=0.06181, over 21504.00 frames. ], tot_loss[loss=0.2105, simple_loss=0.2912, pruned_loss=0.06491, over 4277557.79 frames. ], batch size: 471, lr: 2.98e-03, grad_scale: 16.0 2023-06-27 00:25:28,449 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.027e+02 5.505e+02 8.547e+02 1.330e+03 2.902e+03, threshold=1.709e+03, percent-clipped=6.0 2023-06-27 00:25:41,599 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1686012.0, ans=0.125 2023-06-27 00:25:44,940 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1686072.0, ans=0.0 2023-06-27 00:27:04,090 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1686252.0, ans=0.1 2023-06-27 00:27:10,180 INFO [train.py:996] (2/4) Epoch 10, batch 6600, loss[loss=0.1719, simple_loss=0.244, pruned_loss=0.04989, over 21814.00 frames. ], tot_loss[loss=0.2093, simple_loss=0.2877, pruned_loss=0.0654, over 4257314.27 frames. ], batch size: 98, lr: 2.98e-03, grad_scale: 16.0 2023-06-27 00:27:35,601 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1686372.0, ans=0.0 2023-06-27 00:27:54,311 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1686432.0, ans=0.04949747468305833 2023-06-27 00:28:00,678 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1686432.0, ans=0.0 2023-06-27 00:28:06,272 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1686432.0, ans=0.125 2023-06-27 00:28:24,598 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1686492.0, ans=0.125 2023-06-27 00:28:28,317 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.62 vs. limit=22.5 2023-06-27 00:28:31,528 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1686492.0, ans=0.125 2023-06-27 00:28:34,865 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1686552.0, ans=0.0 2023-06-27 00:28:57,106 INFO [train.py:996] (2/4) Epoch 10, batch 6650, loss[loss=0.1955, simple_loss=0.2672, pruned_loss=0.06187, over 21588.00 frames. ], tot_loss[loss=0.203, simple_loss=0.2813, pruned_loss=0.06239, over 4258539.56 frames. ], batch size: 391, lr: 2.98e-03, grad_scale: 8.0 2023-06-27 00:29:09,397 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.441e+02 5.556e+02 7.751e+02 1.155e+03 2.381e+03, threshold=1.550e+03, percent-clipped=8.0 2023-06-27 00:29:26,084 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.50 vs. limit=22.5 2023-06-27 00:29:40,679 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1686732.0, ans=0.1 2023-06-27 00:29:59,180 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1686732.0, ans=0.125 2023-06-27 00:30:04,496 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1686792.0, ans=0.0 2023-06-27 00:30:48,151 INFO [train.py:996] (2/4) Epoch 10, batch 6700, loss[loss=0.1797, simple_loss=0.2624, pruned_loss=0.04855, over 15841.00 frames. ], tot_loss[loss=0.201, simple_loss=0.2772, pruned_loss=0.06237, over 4259670.37 frames. ], batch size: 60, lr: 2.98e-03, grad_scale: 8.0 2023-06-27 00:30:55,246 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1686912.0, ans=0.0 2023-06-27 00:31:23,288 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.17 vs. limit=15.0 2023-06-27 00:31:39,522 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1687032.0, ans=0.1 2023-06-27 00:32:29,106 INFO [train.py:996] (2/4) Epoch 10, batch 6750, loss[loss=0.2119, simple_loss=0.2841, pruned_loss=0.06982, over 21884.00 frames. ], tot_loss[loss=0.2005, simple_loss=0.2752, pruned_loss=0.06292, over 4266386.36 frames. ], batch size: 118, lr: 2.98e-03, grad_scale: 8.0 2023-06-27 00:32:41,012 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.839e+02 5.646e+02 8.043e+02 1.106e+03 2.898e+03, threshold=1.609e+03, percent-clipped=7.0 2023-06-27 00:33:00,751 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.07 vs. limit=5.0 2023-06-27 00:33:42,819 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.33 vs. limit=12.0 2023-06-27 00:34:13,865 INFO [train.py:996] (2/4) Epoch 10, batch 6800, loss[loss=0.198, simple_loss=0.257, pruned_loss=0.06953, over 21249.00 frames. ], tot_loss[loss=0.2039, simple_loss=0.2779, pruned_loss=0.06494, over 4268575.33 frames. ], batch size: 176, lr: 2.98e-03, grad_scale: 16.0 2023-06-27 00:34:26,979 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.49 vs. limit=15.0 2023-06-27 00:35:49,456 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1687752.0, ans=0.1 2023-06-27 00:36:00,676 INFO [train.py:996] (2/4) Epoch 10, batch 6850, loss[loss=0.2493, simple_loss=0.2847, pruned_loss=0.107, over 21511.00 frames. ], tot_loss[loss=0.2042, simple_loss=0.2762, pruned_loss=0.06609, over 4277706.59 frames. ], batch size: 511, lr: 2.98e-03, grad_scale: 16.0 2023-06-27 00:36:07,586 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.100e+02 5.578e+02 7.964e+02 1.217e+03 2.059e+03, threshold=1.593e+03, percent-clipped=9.0 2023-06-27 00:36:17,259 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.97 vs. limit=15.0 2023-06-27 00:36:23,912 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1687872.0, ans=0.125 2023-06-27 00:36:28,895 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1687872.0, ans=0.125 2023-06-27 00:37:05,074 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1687932.0, ans=0.0 2023-06-27 00:37:05,119 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1687932.0, ans=0.125 2023-06-27 00:37:30,763 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1688052.0, ans=0.1 2023-06-27 00:37:39,226 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1688052.0, ans=0.0 2023-06-27 00:37:47,345 INFO [train.py:996] (2/4) Epoch 10, batch 6900, loss[loss=0.1947, simple_loss=0.2674, pruned_loss=0.061, over 21906.00 frames. ], tot_loss[loss=0.2042, simple_loss=0.2769, pruned_loss=0.06571, over 4279961.62 frames. ], batch size: 107, lr: 2.98e-03, grad_scale: 16.0 2023-06-27 00:38:10,896 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1688172.0, ans=0.0 2023-06-27 00:38:46,718 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1688232.0, ans=0.0 2023-06-27 00:39:14,190 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.86 vs. limit=15.0 2023-06-27 00:39:38,391 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1688352.0, ans=0.0 2023-06-27 00:39:41,214 INFO [train.py:996] (2/4) Epoch 10, batch 6950, loss[loss=0.2188, simple_loss=0.2961, pruned_loss=0.07075, over 21351.00 frames. ], tot_loss[loss=0.2029, simple_loss=0.2792, pruned_loss=0.06332, over 4283911.17 frames. ], batch size: 159, lr: 2.98e-03, grad_scale: 16.0 2023-06-27 00:39:41,915 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1688412.0, ans=0.125 2023-06-27 00:39:47,983 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.023e+02 6.673e+02 8.913e+02 1.216e+03 2.486e+03, threshold=1.783e+03, percent-clipped=9.0 2023-06-27 00:39:52,203 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1688412.0, ans=0.2 2023-06-27 00:40:02,530 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1688472.0, ans=0.125 2023-06-27 00:40:09,616 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1688472.0, ans=0.125 2023-06-27 00:40:18,111 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1688472.0, ans=0.125 2023-06-27 00:40:23,123 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1688472.0, ans=0.125 2023-06-27 00:40:26,952 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1688532.0, ans=0.125 2023-06-27 00:40:33,991 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.23 vs. limit=15.0 2023-06-27 00:40:57,907 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1688592.0, ans=0.0 2023-06-27 00:41:01,433 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1688592.0, ans=0.0 2023-06-27 00:41:28,521 INFO [train.py:996] (2/4) Epoch 10, batch 7000, loss[loss=0.2434, simple_loss=0.3686, pruned_loss=0.05916, over 19838.00 frames. ], tot_loss[loss=0.2059, simple_loss=0.282, pruned_loss=0.06491, over 4272029.72 frames. ], batch size: 702, lr: 2.98e-03, grad_scale: 16.0 2023-06-27 00:41:40,839 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=1688712.0, ans=10.0 2023-06-27 00:42:14,174 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.20 vs. limit=6.0 2023-06-27 00:42:40,293 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.70 vs. limit=22.5 2023-06-27 00:42:41,523 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1688892.0, ans=0.2 2023-06-27 00:43:15,463 INFO [train.py:996] (2/4) Epoch 10, batch 7050, loss[loss=0.1796, simple_loss=0.2572, pruned_loss=0.05096, over 21354.00 frames. ], tot_loss[loss=0.2038, simple_loss=0.2802, pruned_loss=0.06367, over 4267165.02 frames. ], batch size: 211, lr: 2.98e-03, grad_scale: 16.0 2023-06-27 00:43:27,750 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.782e+02 6.761e+02 1.057e+03 1.502e+03 3.144e+03, threshold=2.115e+03, percent-clipped=16.0 2023-06-27 00:44:25,377 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.03 vs. limit=15.0 2023-06-27 00:44:42,518 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1689192.0, ans=0.125 2023-06-27 00:45:09,750 INFO [train.py:996] (2/4) Epoch 10, batch 7100, loss[loss=0.2449, simple_loss=0.3226, pruned_loss=0.08358, over 21560.00 frames. ], tot_loss[loss=0.209, simple_loss=0.2862, pruned_loss=0.06588, over 4268591.85 frames. ], batch size: 414, lr: 2.98e-03, grad_scale: 16.0 2023-06-27 00:46:40,123 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1689552.0, ans=0.1 2023-06-27 00:47:02,312 INFO [train.py:996] (2/4) Epoch 10, batch 7150, loss[loss=0.2251, simple_loss=0.2982, pruned_loss=0.07604, over 21592.00 frames. ], tot_loss[loss=0.2044, simple_loss=0.2818, pruned_loss=0.06348, over 4259554.45 frames. ], batch size: 230, lr: 2.98e-03, grad_scale: 16.0 2023-06-27 00:47:09,232 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.861e+02 6.064e+02 8.725e+02 1.357e+03 2.823e+03, threshold=1.745e+03, percent-clipped=6.0 2023-06-27 00:47:15,451 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.79 vs. limit=15.0 2023-06-27 00:47:33,929 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1689672.0, ans=0.125 2023-06-27 00:47:48,048 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1689732.0, ans=0.1 2023-06-27 00:48:33,043 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn2.whiten.whitening_limit, batch_count=1689852.0, ans=22.5 2023-06-27 00:48:48,716 INFO [train.py:996] (2/4) Epoch 10, batch 7200, loss[loss=0.2068, simple_loss=0.2748, pruned_loss=0.06935, over 21808.00 frames. ], tot_loss[loss=0.2067, simple_loss=0.2832, pruned_loss=0.06507, over 4264890.03 frames. ], batch size: 352, lr: 2.98e-03, grad_scale: 32.0 2023-06-27 00:48:51,038 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1689912.0, ans=0.125 2023-06-27 00:48:51,049 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1689912.0, ans=0.125 2023-06-27 00:49:21,284 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.86 vs. limit=15.0 2023-06-27 00:49:22,147 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1689972.0, ans=0.2 2023-06-27 00:49:28,228 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.28 vs. limit=8.0 2023-06-27 00:50:24,900 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1690152.0, ans=0.0 2023-06-27 00:50:29,778 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1690152.0, ans=0.0 2023-06-27 00:50:34,341 INFO [train.py:996] (2/4) Epoch 10, batch 7250, loss[loss=0.1972, simple_loss=0.2642, pruned_loss=0.06506, over 22029.00 frames. ], tot_loss[loss=0.2048, simple_loss=0.2792, pruned_loss=0.06521, over 4264744.55 frames. ], batch size: 103, lr: 2.98e-03, grad_scale: 32.0 2023-06-27 00:50:40,771 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.886e+02 6.230e+02 8.378e+02 1.198e+03 2.214e+03, threshold=1.676e+03, percent-clipped=4.0 2023-06-27 00:51:05,527 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1690272.0, ans=0.1 2023-06-27 00:51:21,212 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.85 vs. limit=15.0 2023-06-27 00:52:18,850 INFO [train.py:996] (2/4) Epoch 10, batch 7300, loss[loss=0.1736, simple_loss=0.2419, pruned_loss=0.05268, over 21720.00 frames. ], tot_loss[loss=0.2003, simple_loss=0.2726, pruned_loss=0.06398, over 4267987.62 frames. ], batch size: 300, lr: 2.98e-03, grad_scale: 16.0 2023-06-27 00:52:19,294 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1690512.0, ans=0.1 2023-06-27 00:52:22,913 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1690512.0, ans=0.125 2023-06-27 00:53:30,162 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1690692.0, ans=0.04949747468305833 2023-06-27 00:54:06,779 INFO [train.py:996] (2/4) Epoch 10, batch 7350, loss[loss=0.2529, simple_loss=0.3201, pruned_loss=0.0928, over 21582.00 frames. ], tot_loss[loss=0.2008, simple_loss=0.2722, pruned_loss=0.0647, over 4267127.43 frames. ], batch size: 389, lr: 2.98e-03, grad_scale: 16.0 2023-06-27 00:54:12,770 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1690812.0, ans=0.125 2023-06-27 00:54:15,740 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.980e+02 5.910e+02 7.871e+02 1.338e+03 3.655e+03, threshold=1.574e+03, percent-clipped=15.0 2023-06-27 00:55:01,896 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1690932.0, ans=0.0 2023-06-27 00:55:12,890 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.09 vs. limit=15.0 2023-06-27 00:55:29,450 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1690992.0, ans=0.125 2023-06-27 00:55:36,488 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1690992.0, ans=0.1 2023-06-27 00:55:56,538 INFO [train.py:996] (2/4) Epoch 10, batch 7400, loss[loss=0.2494, simple_loss=0.3191, pruned_loss=0.0898, over 21331.00 frames. ], tot_loss[loss=0.2083, simple_loss=0.281, pruned_loss=0.06783, over 4273424.53 frames. ], batch size: 159, lr: 2.98e-03, grad_scale: 16.0 2023-06-27 00:56:09,572 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1691112.0, ans=0.125 2023-06-27 00:56:29,811 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1691172.0, ans=0.0 2023-06-27 00:56:41,538 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1691232.0, ans=0.2 2023-06-27 00:57:06,850 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1691292.0, ans=0.0 2023-06-27 00:57:06,984 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1691292.0, ans=0.125 2023-06-27 00:57:18,773 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1691292.0, ans=0.0 2023-06-27 00:57:42,490 INFO [train.py:996] (2/4) Epoch 10, batch 7450, loss[loss=0.2005, simple_loss=0.2707, pruned_loss=0.06519, over 21524.00 frames. ], tot_loss[loss=0.2057, simple_loss=0.2791, pruned_loss=0.0662, over 4271968.73 frames. ], batch size: 263, lr: 2.98e-03, grad_scale: 16.0 2023-06-27 00:57:55,827 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1691412.0, ans=0.125 2023-06-27 00:57:56,772 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.982e+02 5.896e+02 9.357e+02 1.491e+03 2.777e+03, threshold=1.871e+03, percent-clipped=18.0 2023-06-27 00:58:19,568 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.86 vs. limit=15.0 2023-06-27 00:58:40,140 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1691532.0, ans=0.125 2023-06-27 00:58:40,243 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1691532.0, ans=0.0 2023-06-27 00:59:11,398 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1691592.0, ans=0.125 2023-06-27 00:59:37,959 INFO [train.py:996] (2/4) Epoch 10, batch 7500, loss[loss=0.2338, simple_loss=0.3186, pruned_loss=0.07451, over 21433.00 frames. ], tot_loss[loss=0.2087, simple_loss=0.2833, pruned_loss=0.06702, over 4269845.48 frames. ], batch size: 131, lr: 2.98e-03, grad_scale: 16.0 2023-06-27 00:59:40,017 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1691712.0, ans=0.1 2023-06-27 00:59:49,677 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.38 vs. limit=15.0 2023-06-27 01:00:01,675 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1691772.0, ans=0.0 2023-06-27 01:00:20,627 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1691772.0, ans=0.2 2023-06-27 01:00:37,886 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1691832.0, ans=0.125 2023-06-27 01:00:46,764 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 01:00:51,711 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1691892.0, ans=0.125 2023-06-27 01:00:55,234 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1691892.0, ans=0.125 2023-06-27 01:01:02,267 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1691952.0, ans=0.0 2023-06-27 01:01:31,419 INFO [train.py:996] (2/4) Epoch 10, batch 7550, loss[loss=0.1903, simple_loss=0.2841, pruned_loss=0.04824, over 21626.00 frames. ], tot_loss[loss=0.2121, simple_loss=0.2915, pruned_loss=0.06631, over 4276889.08 frames. ], batch size: 230, lr: 2.98e-03, grad_scale: 16.0 2023-06-27 01:01:39,831 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.177e+02 6.369e+02 9.874e+02 1.839e+03 3.635e+03, threshold=1.975e+03, percent-clipped=22.0 2023-06-27 01:02:28,733 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.14 vs. limit=15.0 2023-06-27 01:02:57,691 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1692252.0, ans=0.125 2023-06-27 01:03:06,154 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1692252.0, ans=0.1 2023-06-27 01:03:11,967 INFO [train.py:996] (2/4) Epoch 10, batch 7600, loss[loss=0.1838, simple_loss=0.2523, pruned_loss=0.05761, over 16653.00 frames. ], tot_loss[loss=0.2106, simple_loss=0.2905, pruned_loss=0.06535, over 4280925.87 frames. ], batch size: 60, lr: 2.97e-03, grad_scale: 32.0 2023-06-27 01:03:35,454 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1692372.0, ans=0.125 2023-06-27 01:03:36,936 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1692372.0, ans=0.125 2023-06-27 01:04:21,605 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.48 vs. limit=12.0 2023-06-27 01:04:22,802 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1692492.0, ans=0.125 2023-06-27 01:05:03,914 INFO [train.py:996] (2/4) Epoch 10, batch 7650, loss[loss=0.2089, simple_loss=0.2795, pruned_loss=0.0692, over 21829.00 frames. ], tot_loss[loss=0.2108, simple_loss=0.2891, pruned_loss=0.06625, over 4283555.52 frames. ], batch size: 247, lr: 2.97e-03, grad_scale: 32.0 2023-06-27 01:05:12,465 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.938e+02 5.695e+02 7.737e+02 9.992e+02 2.893e+03, threshold=1.547e+03, percent-clipped=4.0 2023-06-27 01:05:54,800 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1692732.0, ans=0.1 2023-06-27 01:06:05,007 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1692792.0, ans=0.2 2023-06-27 01:06:32,107 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1692852.0, ans=0.125 2023-06-27 01:06:51,679 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1692912.0, ans=0.125 2023-06-27 01:06:52,713 INFO [train.py:996] (2/4) Epoch 10, batch 7700, loss[loss=0.2669, simple_loss=0.3321, pruned_loss=0.1008, over 21794.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.2914, pruned_loss=0.06909, over 4284773.67 frames. ], batch size: 441, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:07:12,591 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1692912.0, ans=0.125 2023-06-27 01:07:46,873 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1693032.0, ans=0.0 2023-06-27 01:07:47,690 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.50 vs. limit=8.0 2023-06-27 01:07:58,588 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.12 vs. limit=15.0 2023-06-27 01:08:01,444 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1693092.0, ans=0.0 2023-06-27 01:08:43,832 INFO [train.py:996] (2/4) Epoch 10, batch 7750, loss[loss=0.1514, simple_loss=0.2165, pruned_loss=0.04319, over 17208.00 frames. ], tot_loss[loss=0.2167, simple_loss=0.2953, pruned_loss=0.06907, over 4284033.59 frames. ], batch size: 62, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:09:05,028 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.135e+02 8.248e+02 1.279e+03 1.795e+03 4.947e+03, threshold=2.557e+03, percent-clipped=28.0 2023-06-27 01:09:47,402 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1693392.0, ans=0.0 2023-06-27 01:10:13,753 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1693452.0, ans=0.125 2023-06-27 01:10:14,370 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn2.whiten.whitening_limit, batch_count=1693452.0, ans=22.5 2023-06-27 01:10:15,973 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.67 vs. limit=22.5 2023-06-27 01:10:17,268 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1693452.0, ans=0.1 2023-06-27 01:10:23,205 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.42 vs. limit=22.5 2023-06-27 01:10:42,231 INFO [train.py:996] (2/4) Epoch 10, batch 7800, loss[loss=0.1897, simple_loss=0.2581, pruned_loss=0.06062, over 21473.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.2979, pruned_loss=0.06978, over 4281850.01 frames. ], batch size: 195, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:11:01,459 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1693572.0, ans=0.125 2023-06-27 01:11:07,927 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1693572.0, ans=0.0 2023-06-27 01:11:17,978 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1693632.0, ans=0.2 2023-06-27 01:11:39,494 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1693692.0, ans=0.0 2023-06-27 01:11:44,613 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1693692.0, ans=0.0 2023-06-27 01:12:12,629 INFO [train.py:996] (2/4) Epoch 10, batch 7850, loss[loss=0.191, simple_loss=0.2571, pruned_loss=0.06247, over 21591.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.2914, pruned_loss=0.06912, over 4281369.29 frames. ], batch size: 298, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:12:32,518 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.059e+02 5.917e+02 8.514e+02 1.468e+03 3.815e+03, threshold=1.703e+03, percent-clipped=5.0 2023-06-27 01:12:45,518 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1693872.0, ans=0.0 2023-06-27 01:12:47,133 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1693872.0, ans=0.0 2023-06-27 01:14:08,067 INFO [train.py:996] (2/4) Epoch 10, batch 7900, loss[loss=0.3017, simple_loss=0.3831, pruned_loss=0.1102, over 21445.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2883, pruned_loss=0.0692, over 4276818.04 frames. ], batch size: 507, lr: 2.97e-03, grad_scale: 8.0 2023-06-27 01:14:19,690 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1694112.0, ans=0.1 2023-06-27 01:14:26,552 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.11 vs. limit=15.0 2023-06-27 01:15:36,026 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1694352.0, ans=0.0 2023-06-27 01:16:04,759 INFO [train.py:996] (2/4) Epoch 10, batch 7950, loss[loss=0.2045, simple_loss=0.2892, pruned_loss=0.05993, over 21822.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2943, pruned_loss=0.0694, over 4274315.89 frames. ], batch size: 282, lr: 2.97e-03, grad_scale: 8.0 2023-06-27 01:16:05,631 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1694412.0, ans=0.07 2023-06-27 01:16:12,776 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1694412.0, ans=0.07 2023-06-27 01:16:16,929 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.966e+02 5.576e+02 7.742e+02 1.234e+03 3.670e+03, threshold=1.548e+03, percent-clipped=16.0 2023-06-27 01:16:33,756 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1694472.0, ans=0.07 2023-06-27 01:17:50,248 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.71 vs. limit=10.0 2023-06-27 01:17:56,283 INFO [train.py:996] (2/4) Epoch 10, batch 8000, loss[loss=0.2246, simple_loss=0.3147, pruned_loss=0.06729, over 21867.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.2986, pruned_loss=0.07083, over 4264878.68 frames. ], batch size: 316, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:19:09,778 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.00 vs. limit=6.0 2023-06-27 01:19:21,753 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1694892.0, ans=0.125 2023-06-27 01:19:26,846 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1694892.0, ans=0.015 2023-06-27 01:20:02,339 INFO [train.py:996] (2/4) Epoch 10, batch 8050, loss[loss=0.1444, simple_loss=0.1893, pruned_loss=0.0497, over 16284.00 frames. ], tot_loss[loss=0.2215, simple_loss=0.3016, pruned_loss=0.07067, over 4260656.66 frames. ], batch size: 60, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:20:05,351 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.26 vs. limit=12.0 2023-06-27 01:20:10,082 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1695012.0, ans=0.125 2023-06-27 01:20:13,566 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 01:20:14,620 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.526e+02 7.082e+02 1.044e+03 1.392e+03 2.627e+03, threshold=2.088e+03, percent-clipped=20.0 2023-06-27 01:20:20,801 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1695072.0, ans=0.125 2023-06-27 01:20:56,720 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.29 vs. limit=6.0 2023-06-27 01:21:14,699 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1695192.0, ans=0.125 2023-06-27 01:21:18,237 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1695192.0, ans=0.0 2023-06-27 01:21:34,187 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1695252.0, ans=0.0 2023-06-27 01:21:39,360 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1695252.0, ans=0.125 2023-06-27 01:21:46,603 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1695252.0, ans=0.0 2023-06-27 01:21:51,398 INFO [train.py:996] (2/4) Epoch 10, batch 8100, loss[loss=0.2176, simple_loss=0.2868, pruned_loss=0.07422, over 21834.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.3014, pruned_loss=0.07198, over 4268864.38 frames. ], batch size: 282, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:21:52,342 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1695312.0, ans=10.0 2023-06-27 01:22:19,774 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1695372.0, ans=0.1 2023-06-27 01:23:50,280 INFO [train.py:996] (2/4) Epoch 10, batch 8150, loss[loss=0.2631, simple_loss=0.3708, pruned_loss=0.07771, over 21703.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.3106, pruned_loss=0.07306, over 4270526.19 frames. ], batch size: 414, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:24:07,915 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.023e+02 5.816e+02 8.551e+02 1.587e+03 5.169e+03, threshold=1.710e+03, percent-clipped=17.0 2023-06-27 01:24:08,409 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_positive, batch_count=1695612.0, ans=0.05 2023-06-27 01:25:31,242 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.71 vs. limit=5.0 2023-06-27 01:25:38,349 INFO [train.py:996] (2/4) Epoch 10, batch 8200, loss[loss=0.1798, simple_loss=0.2444, pruned_loss=0.05766, over 21436.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.3018, pruned_loss=0.07066, over 4268051.69 frames. ], batch size: 160, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:27:03,382 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1696152.0, ans=0.125 2023-06-27 01:27:32,707 INFO [train.py:996] (2/4) Epoch 10, batch 8250, loss[loss=0.1952, simple_loss=0.2984, pruned_loss=0.04599, over 21807.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.297, pruned_loss=0.06969, over 4261995.62 frames. ], batch size: 316, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:27:38,472 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1696212.0, ans=0.125 2023-06-27 01:27:44,589 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.725e+02 5.485e+02 7.641e+02 1.335e+03 2.771e+03, threshold=1.528e+03, percent-clipped=11.0 2023-06-27 01:29:13,374 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1696452.0, ans=0.125 2023-06-27 01:29:21,575 INFO [train.py:996] (2/4) Epoch 10, batch 8300, loss[loss=0.1909, simple_loss=0.2728, pruned_loss=0.05449, over 21335.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2967, pruned_loss=0.06715, over 4264099.82 frames. ], batch size: 176, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:29:28,949 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1696512.0, ans=0.125 2023-06-27 01:30:36,183 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1696692.0, ans=0.035 2023-06-27 01:31:11,539 INFO [train.py:996] (2/4) Epoch 10, batch 8350, loss[loss=0.2262, simple_loss=0.3066, pruned_loss=0.07296, over 20744.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.2958, pruned_loss=0.06524, over 4262277.44 frames. ], batch size: 607, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:31:23,487 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.611e+02 5.774e+02 7.528e+02 1.140e+03 3.100e+03, threshold=1.506e+03, percent-clipped=11.0 2023-06-27 01:31:33,514 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1696872.0, ans=0.125 2023-06-27 01:32:15,191 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1696992.0, ans=0.2 2023-06-27 01:32:30,553 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1696992.0, ans=0.025 2023-06-27 01:33:01,183 INFO [train.py:996] (2/4) Epoch 10, batch 8400, loss[loss=0.1633, simple_loss=0.2499, pruned_loss=0.03835, over 21160.00 frames. ], tot_loss[loss=0.2086, simple_loss=0.2917, pruned_loss=0.06278, over 4265218.70 frames. ], batch size: 176, lr: 2.97e-03, grad_scale: 32.0 2023-06-27 01:33:52,020 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1697232.0, ans=0.0 2023-06-27 01:33:57,714 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 01:34:31,474 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.88 vs. limit=10.0 2023-06-27 01:34:42,213 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1697352.0, ans=0.125 2023-06-27 01:34:45,876 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 01:34:48,813 INFO [train.py:996] (2/4) Epoch 10, batch 8450, loss[loss=0.1955, simple_loss=0.2681, pruned_loss=0.06147, over 21858.00 frames. ], tot_loss[loss=0.2069, simple_loss=0.2904, pruned_loss=0.06169, over 4268667.49 frames. ], batch size: 282, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:35:02,441 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.213e+02 7.215e+02 1.072e+03 1.642e+03 3.949e+03, threshold=2.143e+03, percent-clipped=30.0 2023-06-27 01:35:11,303 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1697472.0, ans=0.2 2023-06-27 01:35:47,881 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1697532.0, ans=0.2 2023-06-27 01:35:50,595 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.97 vs. limit=15.0 2023-06-27 01:35:51,576 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1697592.0, ans=0.5 2023-06-27 01:36:38,004 INFO [train.py:996] (2/4) Epoch 10, batch 8500, loss[loss=0.2408, simple_loss=0.2901, pruned_loss=0.09572, over 21385.00 frames. ], tot_loss[loss=0.2078, simple_loss=0.288, pruned_loss=0.06376, over 4272533.86 frames. ], batch size: 508, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:37:20,697 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1697772.0, ans=0.0 2023-06-27 01:37:23,048 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.35 vs. limit=15.0 2023-06-27 01:37:31,260 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1697832.0, ans=0.125 2023-06-27 01:37:31,850 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.44 vs. limit=6.0 2023-06-27 01:38:04,427 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1697952.0, ans=0.035 2023-06-27 01:38:28,161 INFO [train.py:996] (2/4) Epoch 10, batch 8550, loss[loss=0.2211, simple_loss=0.3045, pruned_loss=0.06888, over 21447.00 frames. ], tot_loss[loss=0.2122, simple_loss=0.2926, pruned_loss=0.06592, over 4272775.91 frames. ], batch size: 194, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:38:41,907 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.182e+02 6.171e+02 1.011e+03 1.607e+03 3.555e+03, threshold=2.023e+03, percent-clipped=12.0 2023-06-27 01:38:53,503 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1698072.0, ans=0.125 2023-06-27 01:39:24,163 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.15 vs. limit=10.0 2023-06-27 01:40:17,128 INFO [train.py:996] (2/4) Epoch 10, batch 8600, loss[loss=0.2363, simple_loss=0.3158, pruned_loss=0.07838, over 21687.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.2985, pruned_loss=0.06822, over 4266834.03 frames. ], batch size: 351, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:40:27,862 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1698312.0, ans=0.125 2023-06-27 01:40:59,069 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1698372.0, ans=0.05 2023-06-27 01:41:17,129 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.65 vs. limit=15.0 2023-06-27 01:41:54,187 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1698552.0, ans=0.0 2023-06-27 01:42:05,314 INFO [train.py:996] (2/4) Epoch 10, batch 8650, loss[loss=0.1622, simple_loss=0.2587, pruned_loss=0.0328, over 21652.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.3038, pruned_loss=0.06884, over 4277533.42 frames. ], batch size: 230, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:42:16,704 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1698612.0, ans=0.125 2023-06-27 01:42:24,802 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.451e+02 5.765e+02 7.630e+02 1.183e+03 2.009e+03, threshold=1.526e+03, percent-clipped=0.0 2023-06-27 01:43:11,157 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1698792.0, ans=0.1 2023-06-27 01:43:18,793 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.04 vs. limit=6.0 2023-06-27 01:43:50,347 INFO [train.py:996] (2/4) Epoch 10, batch 8700, loss[loss=0.1862, simple_loss=0.2356, pruned_loss=0.06844, over 20130.00 frames. ], tot_loss[loss=0.2133, simple_loss=0.2951, pruned_loss=0.06574, over 4270219.41 frames. ], batch size: 704, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:43:51,503 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.77 vs. limit=15.0 2023-06-27 01:44:15,753 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1698972.0, ans=0.0 2023-06-27 01:45:06,860 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1699092.0, ans=0.125 2023-06-27 01:45:39,154 INFO [train.py:996] (2/4) Epoch 10, batch 8750, loss[loss=0.2008, simple_loss=0.2692, pruned_loss=0.06619, over 21772.00 frames. ], tot_loss[loss=0.2119, simple_loss=0.2904, pruned_loss=0.06671, over 4273538.48 frames. ], batch size: 298, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:45:59,218 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.645e+02 6.087e+02 8.152e+02 1.140e+03 2.309e+03, threshold=1.630e+03, percent-clipped=11.0 2023-06-27 01:46:11,007 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1699272.0, ans=0.1 2023-06-27 01:46:27,712 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.20 vs. limit=15.0 2023-06-27 01:46:42,697 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1699332.0, ans=0.5 2023-06-27 01:46:45,191 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.75 vs. limit=6.0 2023-06-27 01:47:28,914 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1699452.0, ans=0.1 2023-06-27 01:47:34,999 INFO [train.py:996] (2/4) Epoch 10, batch 8800, loss[loss=0.2106, simple_loss=0.3144, pruned_loss=0.05337, over 19844.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2973, pruned_loss=0.06935, over 4274781.18 frames. ], batch size: 702, lr: 2.97e-03, grad_scale: 32.0 2023-06-27 01:48:04,679 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1699572.0, ans=0.2 2023-06-27 01:48:04,743 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1699572.0, ans=0.05 2023-06-27 01:49:33,065 INFO [train.py:996] (2/4) Epoch 10, batch 8850, loss[loss=0.2281, simple_loss=0.328, pruned_loss=0.06407, over 21340.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.3042, pruned_loss=0.07153, over 4273272.26 frames. ], batch size: 159, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:49:48,559 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.063e+02 5.642e+02 7.591e+02 1.245e+03 2.739e+03, threshold=1.518e+03, percent-clipped=14.0 2023-06-27 01:50:06,132 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1699872.0, ans=0.125 2023-06-27 01:50:09,249 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1699872.0, ans=0.1 2023-06-27 01:51:22,894 INFO [train.py:996] (2/4) Epoch 10, batch 8900, loss[loss=0.2146, simple_loss=0.27, pruned_loss=0.07964, over 21268.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.2976, pruned_loss=0.07055, over 4275424.35 frames. ], batch size: 471, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:52:41,830 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1700292.0, ans=0.125 2023-06-27 01:53:04,903 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1700352.0, ans=0.125 2023-06-27 01:53:20,332 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1700412.0, ans=0.2 2023-06-27 01:53:21,320 INFO [train.py:996] (2/4) Epoch 10, batch 8950, loss[loss=0.2318, simple_loss=0.3438, pruned_loss=0.05992, over 19814.00 frames. ], tot_loss[loss=0.2199, simple_loss=0.3007, pruned_loss=0.06954, over 4271838.01 frames. ], batch size: 702, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:53:22,197 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1700412.0, ans=0.0 2023-06-27 01:53:42,449 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.709e+02 6.064e+02 9.607e+02 1.976e+03 3.801e+03, threshold=1.921e+03, percent-clipped=34.0 2023-06-27 01:54:13,397 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1700532.0, ans=0.125 2023-06-27 01:54:13,895 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.24 vs. limit=22.5 2023-06-27 01:54:15,044 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1700532.0, ans=0.0 2023-06-27 01:54:18,909 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.04 vs. limit=15.0 2023-06-27 01:54:37,976 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1700592.0, ans=0.125 2023-06-27 01:55:09,701 INFO [train.py:996] (2/4) Epoch 10, batch 9000, loss[loss=0.1798, simple_loss=0.2536, pruned_loss=0.05301, over 21670.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2948, pruned_loss=0.06881, over 4276229.96 frames. ], batch size: 282, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:55:09,702 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-27 01:55:28,003 INFO [train.py:1028] (2/4) Epoch 10, validation: loss=0.2678, simple_loss=0.3533, pruned_loss=0.09113, over 1796401.00 frames. 2023-06-27 01:55:28,004 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 23731MB 2023-06-27 01:55:45,006 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1700712.0, ans=0.125 2023-06-27 01:56:14,322 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=6.98 vs. limit=22.5 2023-06-27 01:56:59,208 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1700952.0, ans=0.1 2023-06-27 01:57:06,174 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1700952.0, ans=0.125 2023-06-27 01:57:23,144 INFO [train.py:996] (2/4) Epoch 10, batch 9050, loss[loss=0.2151, simple_loss=0.2978, pruned_loss=0.06622, over 21676.00 frames. ], tot_loss[loss=0.2106, simple_loss=0.289, pruned_loss=0.06616, over 4278936.84 frames. ], batch size: 351, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:57:42,574 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1701012.0, ans=0.125 2023-06-27 01:57:45,759 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.665e+02 7.496e+02 1.289e+03 1.830e+03 3.310e+03, threshold=2.578e+03, percent-clipped=22.0 2023-06-27 01:58:39,757 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1701192.0, ans=0.125 2023-06-27 01:58:44,935 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1701192.0, ans=0.2 2023-06-27 01:58:46,982 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=6.15 vs. limit=22.5 2023-06-27 01:58:48,929 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.89 vs. limit=12.0 2023-06-27 01:59:06,201 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.49 vs. limit=15.0 2023-06-27 01:59:13,818 INFO [train.py:996] (2/4) Epoch 10, batch 9100, loss[loss=0.2134, simple_loss=0.3083, pruned_loss=0.05926, over 21455.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.2939, pruned_loss=0.06765, over 4279575.77 frames. ], batch size: 131, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:59:32,869 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1701312.0, ans=0.0 2023-06-27 02:00:06,767 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1701432.0, ans=0.1 2023-06-27 02:00:31,146 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1701492.0, ans=0.07 2023-06-27 02:01:09,282 INFO [train.py:996] (2/4) Epoch 10, batch 9150, loss[loss=0.2477, simple_loss=0.3407, pruned_loss=0.07739, over 21651.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2991, pruned_loss=0.06593, over 4279531.47 frames. ], batch size: 389, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 02:01:23,842 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1701612.0, ans=0.1 2023-06-27 02:01:24,806 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.482e+02 5.209e+02 7.364e+02 1.147e+03 3.350e+03, threshold=1.473e+03, percent-clipped=3.0 2023-06-27 02:02:59,006 INFO [train.py:996] (2/4) Epoch 10, batch 9200, loss[loss=0.2565, simple_loss=0.3362, pruned_loss=0.08838, over 21472.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.3004, pruned_loss=0.0652, over 4272415.74 frames. ], batch size: 507, lr: 2.97e-03, grad_scale: 32.0 2023-06-27 02:03:02,032 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.88 vs. limit=10.0 2023-06-27 02:03:17,370 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.11 vs. limit=15.0 2023-06-27 02:04:30,541 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1702152.0, ans=0.2 2023-06-27 02:04:45,314 INFO [train.py:996] (2/4) Epoch 10, batch 9250, loss[loss=0.2059, simple_loss=0.2711, pruned_loss=0.07031, over 21191.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.3021, pruned_loss=0.06853, over 4280658.00 frames. ], batch size: 143, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 02:05:02,709 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.922e+02 6.299e+02 8.423e+02 1.393e+03 3.022e+03, threshold=1.685e+03, percent-clipped=24.0 2023-06-27 02:05:43,807 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1702332.0, ans=0.1 2023-06-27 02:06:35,091 INFO [train.py:996] (2/4) Epoch 10, batch 9300, loss[loss=0.1958, simple_loss=0.2685, pruned_loss=0.06154, over 21242.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2963, pruned_loss=0.06814, over 4281343.01 frames. ], batch size: 159, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 02:07:17,457 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1702632.0, ans=0.0 2023-06-27 02:08:19,218 INFO [train.py:996] (2/4) Epoch 10, batch 9350, loss[loss=0.2588, simple_loss=0.3352, pruned_loss=0.09127, over 21793.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.3025, pruned_loss=0.06789, over 4285052.28 frames. ], batch size: 441, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 02:08:46,245 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1702872.0, ans=0.0 2023-06-27 02:08:47,198 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.895e+02 6.669e+02 9.528e+02 1.719e+03 4.361e+03, threshold=1.906e+03, percent-clipped=26.0 2023-06-27 02:08:58,482 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1702872.0, ans=0.0 2023-06-27 02:09:28,861 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1702932.0, ans=0.1 2023-06-27 02:09:45,052 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1702992.0, ans=0.1 2023-06-27 02:09:46,618 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1702992.0, ans=0.2 2023-06-27 02:10:18,904 INFO [train.py:996] (2/4) Epoch 10, batch 9400, loss[loss=0.2014, simple_loss=0.2635, pruned_loss=0.06967, over 21709.00 frames. ], tot_loss[loss=0.2199, simple_loss=0.3032, pruned_loss=0.0683, over 4282727.60 frames. ], batch size: 112, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 02:10:20,477 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.25 vs. limit=6.0 2023-06-27 02:11:06,270 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1703232.0, ans=0.1 2023-06-27 02:11:11,200 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1703232.0, ans=0.07 2023-06-27 02:11:21,339 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1703292.0, ans=0.125 2023-06-27 02:11:38,295 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1703352.0, ans=0.125 2023-06-27 02:11:59,293 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1703352.0, ans=0.1 2023-06-27 02:12:05,141 INFO [train.py:996] (2/4) Epoch 10, batch 9450, loss[loss=0.1741, simple_loss=0.2424, pruned_loss=0.05296, over 21669.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2951, pruned_loss=0.06764, over 4273738.76 frames. ], batch size: 282, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 02:12:14,221 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1703412.0, ans=0.125 2023-06-27 02:12:22,344 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.110e+02 5.502e+02 7.576e+02 1.129e+03 2.324e+03, threshold=1.515e+03, percent-clipped=5.0 2023-06-27 02:12:28,025 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1703472.0, ans=0.0 2023-06-27 02:13:22,143 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1703592.0, ans=0.125 2023-06-27 02:13:41,045 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1703652.0, ans=0.025 2023-06-27 02:13:52,555 INFO [train.py:996] (2/4) Epoch 10, batch 9500, loss[loss=0.1725, simple_loss=0.2638, pruned_loss=0.0406, over 21803.00 frames. ], tot_loss[loss=0.2096, simple_loss=0.2874, pruned_loss=0.06583, over 4270528.98 frames. ], batch size: 316, lr: 2.96e-03, grad_scale: 16.0 2023-06-27 02:15:22,279 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1703952.0, ans=0.0 2023-06-27 02:15:32,839 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1703952.0, ans=0.1 2023-06-27 02:15:40,089 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1703952.0, ans=0.1 2023-06-27 02:15:42,682 INFO [train.py:996] (2/4) Epoch 10, batch 9550, loss[loss=0.2534, simple_loss=0.3308, pruned_loss=0.08803, over 21298.00 frames. ], tot_loss[loss=0.213, simple_loss=0.291, pruned_loss=0.06748, over 4271284.92 frames. ], batch size: 548, lr: 2.96e-03, grad_scale: 16.0 2023-06-27 02:16:04,749 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.287e+02 6.617e+02 9.297e+02 1.429e+03 3.226e+03, threshold=1.859e+03, percent-clipped=22.0 2023-06-27 02:17:29,884 INFO [train.py:996] (2/4) Epoch 10, batch 9600, loss[loss=0.2034, simple_loss=0.2804, pruned_loss=0.06317, over 21946.00 frames. ], tot_loss[loss=0.2147, simple_loss=0.2929, pruned_loss=0.06824, over 4278000.65 frames. ], batch size: 316, lr: 2.96e-03, grad_scale: 32.0 2023-06-27 02:17:36,507 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.86 vs. limit=22.5 2023-06-27 02:17:53,803 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.28 vs. limit=15.0 2023-06-27 02:18:01,616 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1704372.0, ans=0.0 2023-06-27 02:18:46,059 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1704492.0, ans=0.1 2023-06-27 02:19:09,013 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1704552.0, ans=0.05 2023-06-27 02:19:26,616 INFO [train.py:996] (2/4) Epoch 10, batch 9650, loss[loss=0.2515, simple_loss=0.3238, pruned_loss=0.08959, over 21789.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2936, pruned_loss=0.06864, over 4281229.57 frames. ], batch size: 441, lr: 2.96e-03, grad_scale: 32.0 2023-06-27 02:19:31,055 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1704612.0, ans=0.125 2023-06-27 02:19:33,226 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.09 vs. limit=15.0 2023-06-27 02:19:45,811 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.795e+02 6.257e+02 8.564e+02 1.301e+03 2.812e+03, threshold=1.713e+03, percent-clipped=7.0 2023-06-27 02:19:52,278 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1704672.0, ans=0.1 2023-06-27 02:20:26,242 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.19 vs. limit=15.0 2023-06-27 02:20:49,892 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1704792.0, ans=0.125 2023-06-27 02:21:15,568 INFO [train.py:996] (2/4) Epoch 10, batch 9700, loss[loss=0.2123, simple_loss=0.2912, pruned_loss=0.06666, over 21825.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.2989, pruned_loss=0.07066, over 4287162.57 frames. ], batch size: 371, lr: 2.96e-03, grad_scale: 16.0 2023-06-27 02:21:19,602 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1704912.0, ans=0.125 2023-06-27 02:23:03,789 INFO [train.py:996] (2/4) Epoch 10, batch 9750, loss[loss=0.1785, simple_loss=0.2462, pruned_loss=0.05538, over 21518.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2938, pruned_loss=0.06924, over 4273857.15 frames. ], batch size: 230, lr: 2.96e-03, grad_scale: 16.0 2023-06-27 02:23:07,615 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1705212.0, ans=0.0 2023-06-27 02:23:27,951 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.191e+02 6.700e+02 1.068e+03 1.546e+03 3.673e+03, threshold=2.135e+03, percent-clipped=19.0 2023-06-27 02:23:28,570 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1705272.0, ans=0.125 2023-06-27 02:23:47,556 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1705332.0, ans=0.2 2023-06-27 02:24:30,347 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1705452.0, ans=0.1 2023-06-27 02:24:37,517 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1705452.0, ans=0.125 2023-06-27 02:24:42,630 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1705452.0, ans=0.125 2023-06-27 02:24:44,853 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.20 vs. limit=15.0 2023-06-27 02:24:45,105 INFO [train.py:996] (2/4) Epoch 10, batch 9800, loss[loss=0.2157, simple_loss=0.2972, pruned_loss=0.06712, over 21804.00 frames. ], tot_loss[loss=0.2147, simple_loss=0.2909, pruned_loss=0.0692, over 4269187.79 frames. ], batch size: 351, lr: 2.96e-03, grad_scale: 16.0 2023-06-27 02:25:24,131 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.86 vs. limit=22.5 2023-06-27 02:26:23,062 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1705752.0, ans=10.0 2023-06-27 02:26:30,540 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.46 vs. limit=15.0 2023-06-27 02:26:38,267 INFO [train.py:996] (2/4) Epoch 10, batch 9850, loss[loss=0.199, simple_loss=0.2739, pruned_loss=0.06201, over 21841.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2892, pruned_loss=0.06896, over 4272627.47 frames. ], batch size: 107, lr: 2.96e-03, grad_scale: 16.0 2023-06-27 02:26:39,330 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 02:26:43,027 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.07 vs. limit=22.5 2023-06-27 02:26:47,437 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1705812.0, ans=0.0 2023-06-27 02:27:02,355 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.847e+02 5.295e+02 7.367e+02 1.134e+03 2.701e+03, threshold=1.473e+03, percent-clipped=3.0 2023-06-27 02:27:13,067 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.94 vs. limit=15.0 2023-06-27 02:27:17,321 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1705932.0, ans=0.1 2023-06-27 02:27:21,080 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.93 vs. limit=15.0 2023-06-27 02:27:50,020 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1705992.0, ans=0.125 2023-06-27 02:28:11,806 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1706052.0, ans=0.1 2023-06-27 02:28:26,499 INFO [train.py:996] (2/4) Epoch 10, batch 9900, loss[loss=0.2321, simple_loss=0.3247, pruned_loss=0.06977, over 19969.00 frames. ], tot_loss[loss=0.2115, simple_loss=0.2858, pruned_loss=0.06864, over 4269343.41 frames. ], batch size: 703, lr: 2.96e-03, grad_scale: 16.0 2023-06-27 02:29:09,091 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1706232.0, ans=0.125 2023-06-27 02:29:25,037 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1706292.0, ans=0.125 2023-06-27 02:29:48,652 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1706292.0, ans=0.125 2023-06-27 02:29:57,071 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=1706352.0, ans=0.5 2023-06-27 02:30:15,240 INFO [train.py:996] (2/4) Epoch 10, batch 9950, loss[loss=0.1834, simple_loss=0.245, pruned_loss=0.06093, over 21541.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.2861, pruned_loss=0.06975, over 4266594.51 frames. ], batch size: 230, lr: 2.96e-03, grad_scale: 16.0 2023-06-27 02:30:21,345 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1706412.0, ans=0.125 2023-06-27 02:30:38,561 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1706472.0, ans=0.125 2023-06-27 02:30:39,761 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.384e+02 6.546e+02 9.078e+02 1.320e+03 2.583e+03, threshold=1.816e+03, percent-clipped=18.0 2023-06-27 02:30:49,319 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.94 vs. limit=10.0 2023-06-27 02:31:25,173 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1706592.0, ans=0.125 2023-06-27 02:31:59,319 INFO [train.py:996] (2/4) Epoch 10, batch 10000, loss[loss=0.1878, simple_loss=0.2461, pruned_loss=0.06476, over 19961.00 frames. ], tot_loss[loss=0.21, simple_loss=0.2823, pruned_loss=0.06884, over 4266898.25 frames. ], batch size: 703, lr: 2.96e-03, grad_scale: 32.0 2023-06-27 02:32:28,436 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1706772.0, ans=0.2 2023-06-27 02:33:46,325 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1706952.0, ans=0.125 2023-06-27 02:33:57,189 INFO [train.py:996] (2/4) Epoch 10, batch 10050, loss[loss=0.2664, simple_loss=0.3764, pruned_loss=0.07818, over 19889.00 frames. ], tot_loss[loss=0.2118, simple_loss=0.2851, pruned_loss=0.06931, over 4268579.82 frames. ], batch size: 703, lr: 2.96e-03, grad_scale: 32.0 2023-06-27 02:34:06,523 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_ff3.min_abs, batch_count=1707012.0, ans=0.2 2023-06-27 02:34:16,296 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.082e+02 5.853e+02 8.209e+02 1.305e+03 2.955e+03, threshold=1.642e+03, percent-clipped=12.0 2023-06-27 02:35:45,597 INFO [train.py:996] (2/4) Epoch 10, batch 10100, loss[loss=0.2084, simple_loss=0.2791, pruned_loss=0.06889, over 21705.00 frames. ], tot_loss[loss=0.2087, simple_loss=0.2823, pruned_loss=0.06756, over 4273249.05 frames. ], batch size: 247, lr: 2.96e-03, grad_scale: 16.0 2023-06-27 02:35:58,973 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1707312.0, ans=0.0 2023-06-27 02:36:08,228 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.90 vs. limit=15.0 2023-06-27 02:37:33,994 INFO [train.py:996] (2/4) Epoch 10, batch 10150, loss[loss=0.2002, simple_loss=0.2671, pruned_loss=0.06663, over 21830.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2881, pruned_loss=0.07028, over 4275289.23 frames. ], batch size: 107, lr: 2.96e-03, grad_scale: 8.0 2023-06-27 02:38:02,113 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.860e+02 5.691e+02 7.969e+02 1.243e+03 2.132e+03, threshold=1.594e+03, percent-clipped=9.0 2023-06-27 02:38:22,152 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1707732.0, ans=0.125 2023-06-27 02:38:45,129 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.62 vs. limit=15.0 2023-06-27 02:39:14,645 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1707852.0, ans=0.125 2023-06-27 02:39:22,095 INFO [train.py:996] (2/4) Epoch 10, batch 10200, loss[loss=0.1867, simple_loss=0.2669, pruned_loss=0.05326, over 20778.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.2898, pruned_loss=0.06888, over 4266394.44 frames. ], batch size: 607, lr: 2.96e-03, grad_scale: 8.0 2023-06-27 02:40:20,743 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1708032.0, ans=0.0 2023-06-27 02:40:22,603 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1708032.0, ans=0.125 2023-06-27 02:40:45,429 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.40 vs. limit=15.0 2023-06-27 02:40:56,507 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.37 vs. limit=15.0 2023-06-27 02:41:10,238 INFO [train.py:996] (2/4) Epoch 10, batch 10250, loss[loss=0.1827, simple_loss=0.275, pruned_loss=0.04519, over 21451.00 frames. ], tot_loss[loss=0.2076, simple_loss=0.2866, pruned_loss=0.06427, over 4264467.24 frames. ], batch size: 471, lr: 2.96e-03, grad_scale: 8.0 2023-06-27 02:41:19,756 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 02:41:44,130 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.003e+02 5.121e+02 6.832e+02 1.019e+03 2.987e+03, threshold=1.366e+03, percent-clipped=4.0 2023-06-27 02:41:48,649 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1708272.0, ans=0.2 2023-06-27 02:41:52,081 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1708272.0, ans=0.125 2023-06-27 02:42:46,805 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1708452.0, ans=0.125 2023-06-27 02:43:03,420 INFO [train.py:996] (2/4) Epoch 10, batch 10300, loss[loss=0.1739, simple_loss=0.2576, pruned_loss=0.04509, over 20767.00 frames. ], tot_loss[loss=0.2091, simple_loss=0.288, pruned_loss=0.06513, over 4260693.08 frames. ], batch size: 607, lr: 2.96e-03, grad_scale: 8.0 2023-06-27 02:43:30,487 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.43 vs. limit=15.0 2023-06-27 02:43:50,504 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1708572.0, ans=0.125 2023-06-27 02:43:52,552 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1708572.0, ans=0.0 2023-06-27 02:43:54,259 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1708632.0, ans=0.0 2023-06-27 02:44:32,940 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.88 vs. limit=15.0 2023-06-27 02:44:54,350 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1708752.0, ans=0.125 2023-06-27 02:45:06,408 INFO [train.py:996] (2/4) Epoch 10, batch 10350, loss[loss=0.1756, simple_loss=0.237, pruned_loss=0.05708, over 21205.00 frames. ], tot_loss[loss=0.2098, simple_loss=0.2893, pruned_loss=0.06519, over 4255141.89 frames. ], batch size: 159, lr: 2.96e-03, grad_scale: 8.0 2023-06-27 02:45:35,570 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.505e+02 7.876e+02 1.206e+03 1.704e+03 3.503e+03, threshold=2.411e+03, percent-clipped=40.0 2023-06-27 02:45:38,656 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.48 vs. limit=22.5 2023-06-27 02:46:04,695 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1708932.0, ans=0.1 2023-06-27 02:46:57,629 INFO [train.py:996] (2/4) Epoch 10, batch 10400, loss[loss=0.1628, simple_loss=0.218, pruned_loss=0.0538, over 21264.00 frames. ], tot_loss[loss=0.2074, simple_loss=0.2859, pruned_loss=0.06443, over 4259429.17 frames. ], batch size: 159, lr: 2.96e-03, grad_scale: 16.0 2023-06-27 02:47:46,591 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1709232.0, ans=0.125 2023-06-27 02:47:53,752 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1709232.0, ans=0.125 2023-06-27 02:48:34,584 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1709352.0, ans=0.0 2023-06-27 02:48:52,964 INFO [train.py:996] (2/4) Epoch 10, batch 10450, loss[loss=0.2347, simple_loss=0.3243, pruned_loss=0.07255, over 20676.00 frames. ], tot_loss[loss=0.2108, simple_loss=0.2895, pruned_loss=0.06604, over 4265095.89 frames. ], batch size: 607, lr: 2.96e-03, grad_scale: 16.0 2023-06-27 02:49:06,049 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1709412.0, ans=0.1 2023-06-27 02:49:21,569 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.083e+02 7.279e+02 1.026e+03 1.542e+03 3.103e+03, threshold=2.052e+03, percent-clipped=9.0 2023-06-27 02:49:48,397 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1709532.0, ans=0.125 2023-06-27 02:50:41,350 INFO [train.py:996] (2/4) Epoch 10, batch 10500, loss[loss=0.2179, simple_loss=0.2907, pruned_loss=0.07261, over 21833.00 frames. ], tot_loss[loss=0.21, simple_loss=0.2902, pruned_loss=0.06486, over 4255915.14 frames. ], batch size: 98, lr: 2.96e-03, grad_scale: 16.0 2023-06-27 02:51:08,020 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1709772.0, ans=0.2 2023-06-27 02:51:08,712 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.64 vs. limit=22.5 2023-06-27 02:51:32,402 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1709832.0, ans=0.1 2023-06-27 02:52:28,686 INFO [train.py:996] (2/4) Epoch 10, batch 10550, loss[loss=0.1851, simple_loss=0.2562, pruned_loss=0.057, over 21618.00 frames. ], tot_loss[loss=0.2065, simple_loss=0.2849, pruned_loss=0.0641, over 4256313.62 frames. ], batch size: 298, lr: 2.96e-03, grad_scale: 16.0 2023-06-27 02:52:55,947 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.583e+02 5.517e+02 8.817e+02 1.298e+03 2.428e+03, threshold=1.763e+03, percent-clipped=4.0 2023-06-27 02:53:10,297 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.67 vs. limit=22.5 2023-06-27 02:54:16,530 INFO [train.py:996] (2/4) Epoch 10, batch 10600, loss[loss=0.1863, simple_loss=0.2739, pruned_loss=0.04939, over 21668.00 frames. ], tot_loss[loss=0.2031, simple_loss=0.2797, pruned_loss=0.06329, over 4258290.60 frames. ], batch size: 298, lr: 2.96e-03, grad_scale: 16.0 2023-06-27 02:54:19,166 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1710312.0, ans=0.125 2023-06-27 02:54:22,874 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1710312.0, ans=0.1 2023-06-27 02:54:28,857 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.16 vs. limit=22.5 2023-06-27 02:54:50,209 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1710372.0, ans=0.0 2023-06-27 02:54:52,065 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1710372.0, ans=0.125 2023-06-27 02:54:59,354 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1710432.0, ans=0.2 2023-06-27 02:55:17,537 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.20 vs. limit=15.0 2023-06-27 02:55:50,521 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1710552.0, ans=0.04949747468305833 2023-06-27 02:56:13,027 INFO [train.py:996] (2/4) Epoch 10, batch 10650, loss[loss=0.1791, simple_loss=0.2388, pruned_loss=0.05965, over 21928.00 frames. ], tot_loss[loss=0.2057, simple_loss=0.2851, pruned_loss=0.06312, over 4261284.85 frames. ], batch size: 98, lr: 2.96e-03, grad_scale: 16.0 2023-06-27 02:56:35,988 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.041e+02 6.303e+02 9.847e+02 1.673e+03 3.050e+03, threshold=1.969e+03, percent-clipped=22.0 2023-06-27 02:57:11,837 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.20 vs. limit=15.0 2023-06-27 02:57:36,840 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1710852.0, ans=0.07 2023-06-27 02:57:54,705 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1710852.0, ans=0.0 2023-06-27 02:57:58,833 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1710852.0, ans=10.0 2023-06-27 02:58:01,506 INFO [train.py:996] (2/4) Epoch 10, batch 10700, loss[loss=0.2285, simple_loss=0.3034, pruned_loss=0.07675, over 21654.00 frames. ], tot_loss[loss=0.2042, simple_loss=0.2833, pruned_loss=0.06259, over 4251402.82 frames. ], batch size: 351, lr: 2.96e-03, grad_scale: 8.0 2023-06-27 02:58:13,443 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.31 vs. limit=15.0 2023-06-27 02:58:21,431 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1710972.0, ans=0.125 2023-06-27 02:58:35,694 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1710972.0, ans=0.0 2023-06-27 02:59:14,394 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1711092.0, ans=0.0 2023-06-27 02:59:16,293 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1711092.0, ans=0.0 2023-06-27 02:59:51,093 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1711212.0, ans=0.125 2023-06-27 02:59:51,987 INFO [train.py:996] (2/4) Epoch 10, batch 10750, loss[loss=0.2577, simple_loss=0.3502, pruned_loss=0.08263, over 21655.00 frames. ], tot_loss[loss=0.2142, simple_loss=0.2944, pruned_loss=0.067, over 4260216.57 frames. ], batch size: 414, lr: 2.96e-03, grad_scale: 8.0 2023-06-27 03:00:06,267 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1711212.0, ans=0.125 2023-06-27 03:00:20,704 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.08 vs. limit=15.0 2023-06-27 03:00:21,242 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.422e+02 6.069e+02 8.010e+02 1.380e+03 3.013e+03, threshold=1.602e+03, percent-clipped=10.0 2023-06-27 03:01:16,427 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.51 vs. limit=22.5 2023-06-27 03:01:27,736 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1711452.0, ans=0.0 2023-06-27 03:01:41,473 INFO [train.py:996] (2/4) Epoch 10, batch 10800, loss[loss=0.2271, simple_loss=0.3083, pruned_loss=0.07292, over 21725.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.2981, pruned_loss=0.06786, over 4256795.10 frames. ], batch size: 298, lr: 2.96e-03, grad_scale: 16.0 2023-06-27 03:02:11,434 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1711572.0, ans=0.125 2023-06-27 03:02:13,075 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1711572.0, ans=0.1 2023-06-27 03:02:14,862 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1711572.0, ans=0.0 2023-06-27 03:02:18,650 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.88 vs. limit=6.0 2023-06-27 03:02:22,384 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.29 vs. limit=15.0 2023-06-27 03:02:37,210 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1711632.0, ans=0.0 2023-06-27 03:02:52,226 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.05 vs. limit=10.0 2023-06-27 03:03:30,070 INFO [train.py:996] (2/4) Epoch 10, batch 10850, loss[loss=0.2167, simple_loss=0.2891, pruned_loss=0.0722, over 21584.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.297, pruned_loss=0.06855, over 4260455.69 frames. ], batch size: 441, lr: 2.96e-03, grad_scale: 16.0 2023-06-27 03:04:05,455 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.119e+02 5.251e+02 7.747e+02 1.275e+03 2.663e+03, threshold=1.549e+03, percent-clipped=11.0 2023-06-27 03:04:27,244 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1711932.0, ans=0.1 2023-06-27 03:05:17,300 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1712052.0, ans=0.125 2023-06-27 03:05:21,072 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1712052.0, ans=0.2 2023-06-27 03:05:23,780 INFO [train.py:996] (2/4) Epoch 10, batch 10900, loss[loss=0.203, simple_loss=0.2965, pruned_loss=0.05478, over 21736.00 frames. ], tot_loss[loss=0.2117, simple_loss=0.2908, pruned_loss=0.06625, over 4262401.38 frames. ], batch size: 351, lr: 2.96e-03, grad_scale: 8.0 2023-06-27 03:05:38,901 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1712112.0, ans=0.125 2023-06-27 03:06:01,323 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1712172.0, ans=0.0 2023-06-27 03:06:20,684 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1712232.0, ans=0.1 2023-06-27 03:06:31,080 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1712292.0, ans=0.035 2023-06-27 03:06:45,380 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1712292.0, ans=0.0 2023-06-27 03:07:12,335 INFO [train.py:996] (2/4) Epoch 10, batch 10950, loss[loss=0.1735, simple_loss=0.2476, pruned_loss=0.04975, over 21496.00 frames. ], tot_loss[loss=0.2079, simple_loss=0.2865, pruned_loss=0.06465, over 4264724.20 frames. ], batch size: 132, lr: 2.96e-03, grad_scale: 8.0 2023-06-27 03:07:25,880 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1712412.0, ans=0.125 2023-06-27 03:07:48,606 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.904e+02 6.171e+02 9.007e+02 1.291e+03 2.424e+03, threshold=1.801e+03, percent-clipped=14.0 2023-06-27 03:08:06,541 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1712532.0, ans=0.0 2023-06-27 03:08:38,689 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1712652.0, ans=0.2 2023-06-27 03:08:51,782 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.87 vs. limit=22.5 2023-06-27 03:08:58,756 INFO [train.py:996] (2/4) Epoch 10, batch 11000, loss[loss=0.2302, simple_loss=0.3018, pruned_loss=0.07932, over 21936.00 frames. ], tot_loss[loss=0.2082, simple_loss=0.2851, pruned_loss=0.06564, over 4272858.19 frames. ], batch size: 415, lr: 2.96e-03, grad_scale: 8.0 2023-06-27 03:09:00,071 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.45 vs. limit=22.5 2023-06-27 03:09:08,134 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1712712.0, ans=0.125 2023-06-27 03:09:15,826 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1712712.0, ans=0.035 2023-06-27 03:09:34,964 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1712772.0, ans=0.025 2023-06-27 03:10:35,978 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.84 vs. limit=22.5 2023-06-27 03:10:46,685 INFO [train.py:996] (2/4) Epoch 10, batch 11050, loss[loss=0.1769, simple_loss=0.2231, pruned_loss=0.06535, over 20075.00 frames. ], tot_loss[loss=0.2079, simple_loss=0.2828, pruned_loss=0.06644, over 4275766.80 frames. ], batch size: 704, lr: 2.96e-03, grad_scale: 8.0 2023-06-27 03:10:47,909 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1713012.0, ans=0.125 2023-06-27 03:11:22,037 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.001e+02 5.814e+02 8.503e+02 1.206e+03 2.810e+03, threshold=1.701e+03, percent-clipped=7.0 2023-06-27 03:11:34,936 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1713132.0, ans=0.125 2023-06-27 03:11:47,099 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.25 vs. limit=12.0 2023-06-27 03:11:57,172 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1713192.0, ans=0.2 2023-06-27 03:12:00,227 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1713192.0, ans=0.125 2023-06-27 03:12:10,781 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1713252.0, ans=0.0 2023-06-27 03:12:33,217 INFO [train.py:996] (2/4) Epoch 10, batch 11100, loss[loss=0.1848, simple_loss=0.2622, pruned_loss=0.05371, over 21500.00 frames. ], tot_loss[loss=0.2074, simple_loss=0.2819, pruned_loss=0.06648, over 4274496.20 frames. ], batch size: 230, lr: 2.96e-03, grad_scale: 8.0 2023-06-27 03:12:53,553 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1713312.0, ans=0.07 2023-06-27 03:13:01,157 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1713372.0, ans=0.0 2023-06-27 03:13:13,290 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1713372.0, ans=0.0 2023-06-27 03:14:12,476 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.43 vs. limit=15.0 2023-06-27 03:14:22,326 INFO [train.py:996] (2/4) Epoch 10, batch 11150, loss[loss=0.278, simple_loss=0.3398, pruned_loss=0.1081, over 21405.00 frames. ], tot_loss[loss=0.2058, simple_loss=0.2798, pruned_loss=0.06596, over 4266387.69 frames. ], batch size: 507, lr: 2.96e-03, grad_scale: 8.0 2023-06-27 03:14:38,812 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1713612.0, ans=0.0 2023-06-27 03:14:43,790 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1713672.0, ans=0.0 2023-06-27 03:14:58,533 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.769e+02 5.768e+02 8.894e+02 1.400e+03 2.503e+03, threshold=1.779e+03, percent-clipped=10.0 2023-06-27 03:15:29,609 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1713792.0, ans=0.015 2023-06-27 03:15:34,688 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1713792.0, ans=0.125 2023-06-27 03:16:02,245 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1713852.0, ans=0.2 2023-06-27 03:16:08,631 INFO [train.py:996] (2/4) Epoch 10, batch 11200, loss[loss=0.1831, simple_loss=0.248, pruned_loss=0.05911, over 21404.00 frames. ], tot_loss[loss=0.2045, simple_loss=0.2786, pruned_loss=0.0652, over 4262965.74 frames. ], batch size: 194, lr: 2.96e-03, grad_scale: 16.0 2023-06-27 03:17:01,667 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1714032.0, ans=0.125 2023-06-27 03:17:39,298 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1714152.0, ans=0.125 2023-06-27 03:17:44,551 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1714152.0, ans=0.125 2023-06-27 03:17:54,754 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1714212.0, ans=10.0 2023-06-27 03:17:55,882 INFO [train.py:996] (2/4) Epoch 10, batch 11250, loss[loss=0.1996, simple_loss=0.2826, pruned_loss=0.05828, over 21825.00 frames. ], tot_loss[loss=0.204, simple_loss=0.2778, pruned_loss=0.06514, over 4257795.10 frames. ], batch size: 316, lr: 2.96e-03, grad_scale: 16.0 2023-06-27 03:18:26,749 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.073e+02 5.382e+02 8.145e+02 1.130e+03 2.477e+03, threshold=1.629e+03, percent-clipped=7.0 2023-06-27 03:19:38,949 INFO [train.py:996] (2/4) Epoch 10, batch 11300, loss[loss=0.2182, simple_loss=0.3111, pruned_loss=0.06262, over 19969.00 frames. ], tot_loss[loss=0.2043, simple_loss=0.2793, pruned_loss=0.06467, over 4265808.20 frames. ], batch size: 702, lr: 2.96e-03, grad_scale: 16.0 2023-06-27 03:19:43,244 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1714512.0, ans=0.2 2023-06-27 03:20:31,064 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1714632.0, ans=0.0 2023-06-27 03:20:32,623 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1714632.0, ans=0.0 2023-06-27 03:21:02,974 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1714692.0, ans=0.125 2023-06-27 03:21:22,965 INFO [train.py:996] (2/4) Epoch 10, batch 11350, loss[loss=0.2162, simple_loss=0.3094, pruned_loss=0.06146, over 21284.00 frames. ], tot_loss[loss=0.2047, simple_loss=0.2805, pruned_loss=0.0645, over 4269309.42 frames. ], batch size: 548, lr: 2.96e-03, grad_scale: 16.0 2023-06-27 03:21:39,817 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1714812.0, ans=0.125 2023-06-27 03:22:00,034 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.082e+02 5.912e+02 7.672e+02 1.183e+03 2.053e+03, threshold=1.534e+03, percent-clipped=10.0 2023-06-27 03:22:05,663 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1714872.0, ans=0.0 2023-06-27 03:22:44,937 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1714992.0, ans=0.125 2023-06-27 03:23:08,160 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1715052.0, ans=0.0 2023-06-27 03:23:12,828 INFO [train.py:996] (2/4) Epoch 10, batch 11400, loss[loss=0.2148, simple_loss=0.3085, pruned_loss=0.06049, over 21693.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.2863, pruned_loss=0.06706, over 4276350.04 frames. ], batch size: 332, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 03:23:34,629 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1715112.0, ans=0.125 2023-06-27 03:25:07,601 INFO [train.py:996] (2/4) Epoch 10, batch 11450, loss[loss=0.216, simple_loss=0.2937, pruned_loss=0.06919, over 21785.00 frames. ], tot_loss[loss=0.2108, simple_loss=0.2886, pruned_loss=0.06644, over 4275416.27 frames. ], batch size: 247, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 03:25:10,118 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1715412.0, ans=0.1 2023-06-27 03:25:22,445 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1715412.0, ans=10.0 2023-06-27 03:25:33,683 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.931e+02 7.490e+02 1.068e+03 1.427e+03 2.700e+03, threshold=2.136e+03, percent-clipped=19.0 2023-06-27 03:25:39,611 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1715472.0, ans=0.125 2023-06-27 03:25:44,823 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1715532.0, ans=0.125 2023-06-27 03:26:17,781 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1715592.0, ans=0.125 2023-06-27 03:26:28,618 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 03:26:50,427 INFO [train.py:996] (2/4) Epoch 10, batch 11500, loss[loss=0.1798, simple_loss=0.2761, pruned_loss=0.04173, over 21720.00 frames. ], tot_loss[loss=0.2137, simple_loss=0.2917, pruned_loss=0.06785, over 4277755.55 frames. ], batch size: 247, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 03:27:01,462 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1715712.0, ans=0.2 2023-06-27 03:27:10,452 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1715712.0, ans=0.09899494936611666 2023-06-27 03:27:31,266 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.35 vs. limit=12.0 2023-06-27 03:28:02,836 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1715892.0, ans=0.95 2023-06-27 03:28:45,010 INFO [train.py:996] (2/4) Epoch 10, batch 11550, loss[loss=0.2389, simple_loss=0.3438, pruned_loss=0.067, over 21781.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.2962, pruned_loss=0.0678, over 4271731.15 frames. ], batch size: 332, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 03:28:58,353 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1716012.0, ans=0.0 2023-06-27 03:29:09,305 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1716072.0, ans=0.1 2023-06-27 03:29:17,158 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.702e+02 7.297e+02 1.033e+03 1.557e+03 3.418e+03, threshold=2.066e+03, percent-clipped=10.0 2023-06-27 03:29:21,272 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1716072.0, ans=0.2 2023-06-27 03:30:32,992 INFO [train.py:996] (2/4) Epoch 10, batch 11600, loss[loss=0.2302, simple_loss=0.3315, pruned_loss=0.0644, over 21711.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.3106, pruned_loss=0.07002, over 4274315.11 frames. ], batch size: 247, lr: 2.95e-03, grad_scale: 32.0 2023-06-27 03:30:47,510 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1716312.0, ans=0.125 2023-06-27 03:31:33,060 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1716432.0, ans=0.2 2023-06-27 03:32:07,878 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1716552.0, ans=0.0 2023-06-27 03:32:20,570 INFO [train.py:996] (2/4) Epoch 10, batch 11650, loss[loss=0.262, simple_loss=0.3285, pruned_loss=0.09777, over 21516.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.3178, pruned_loss=0.07163, over 4269077.83 frames. ], batch size: 441, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 03:32:52,968 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.091e+02 7.350e+02 9.956e+02 1.670e+03 3.528e+03, threshold=1.991e+03, percent-clipped=18.0 2023-06-27 03:33:04,657 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1716732.0, ans=0.125 2023-06-27 03:33:08,598 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=1716732.0, ans=15.0 2023-06-27 03:33:09,820 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1716732.0, ans=0.125 2023-06-27 03:33:22,289 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 03:34:07,078 INFO [train.py:996] (2/4) Epoch 10, batch 11700, loss[loss=0.1981, simple_loss=0.2619, pruned_loss=0.06711, over 21483.00 frames. ], tot_loss[loss=0.2257, simple_loss=0.3094, pruned_loss=0.07098, over 4276492.48 frames. ], batch size: 195, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 03:34:14,858 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.35 vs. limit=15.0 2023-06-27 03:34:57,861 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1717032.0, ans=0.0 2023-06-27 03:34:59,248 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=1717032.0, ans=0.025 2023-06-27 03:34:59,250 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1717032.0, ans=0.125 2023-06-27 03:35:31,491 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1717092.0, ans=0.125 2023-06-27 03:35:53,412 INFO [train.py:996] (2/4) Epoch 10, batch 11750, loss[loss=0.2213, simple_loss=0.307, pruned_loss=0.06774, over 21452.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.3004, pruned_loss=0.07045, over 4278523.61 frames. ], batch size: 131, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 03:35:54,172 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 03:36:26,196 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.050e+02 5.774e+02 7.571e+02 1.065e+03 1.774e+03, threshold=1.514e+03, percent-clipped=0.0 2023-06-27 03:37:00,791 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1717392.0, ans=0.0 2023-06-27 03:37:32,569 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1717452.0, ans=0.1 2023-06-27 03:37:41,065 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1717512.0, ans=0.125 2023-06-27 03:37:42,105 INFO [train.py:996] (2/4) Epoch 10, batch 11800, loss[loss=0.2371, simple_loss=0.3141, pruned_loss=0.08003, over 21538.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.3005, pruned_loss=0.07105, over 4280088.46 frames. ], batch size: 131, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 03:37:49,133 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.12 vs. limit=22.5 2023-06-27 03:38:04,286 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 03:39:30,353 INFO [train.py:996] (2/4) Epoch 10, batch 11850, loss[loss=0.2501, simple_loss=0.3349, pruned_loss=0.08262, over 21561.00 frames. ], tot_loss[loss=0.2212, simple_loss=0.3016, pruned_loss=0.07038, over 4280958.83 frames. ], batch size: 471, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 03:39:51,393 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.16 vs. limit=15.0 2023-06-27 03:40:09,304 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.078e+02 6.779e+02 9.644e+02 1.423e+03 2.292e+03, threshold=1.929e+03, percent-clipped=21.0 2023-06-27 03:40:28,150 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1717932.0, ans=0.2 2023-06-27 03:40:57,842 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1717992.0, ans=0.125 2023-06-27 03:41:25,955 INFO [train.py:996] (2/4) Epoch 10, batch 11900, loss[loss=0.1788, simple_loss=0.2526, pruned_loss=0.05251, over 21738.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.3036, pruned_loss=0.06834, over 4273099.91 frames. ], batch size: 124, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 03:42:31,883 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1718232.0, ans=0.125 2023-06-27 03:42:40,703 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1718292.0, ans=0.125 2023-06-27 03:42:48,696 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.95 vs. limit=22.5 2023-06-27 03:43:15,208 INFO [train.py:996] (2/4) Epoch 10, batch 11950, loss[loss=0.1898, simple_loss=0.2841, pruned_loss=0.0478, over 21665.00 frames. ], tot_loss[loss=0.217, simple_loss=0.3033, pruned_loss=0.06537, over 4271707.31 frames. ], batch size: 298, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 03:43:53,626 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.803e+02 5.577e+02 8.393e+02 1.338e+03 3.088e+03, threshold=1.679e+03, percent-clipped=11.0 2023-06-27 03:44:12,212 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.17 vs. limit=10.0 2023-06-27 03:44:32,829 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1718592.0, ans=0.2 2023-06-27 03:45:09,413 INFO [train.py:996] (2/4) Epoch 10, batch 12000, loss[loss=0.1935, simple_loss=0.2592, pruned_loss=0.06395, over 21796.00 frames. ], tot_loss[loss=0.2115, simple_loss=0.2961, pruned_loss=0.06344, over 4251941.26 frames. ], batch size: 124, lr: 2.95e-03, grad_scale: 32.0 2023-06-27 03:45:09,414 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-27 03:45:28,123 INFO [zipformer.py:1728] (2/4) name=encoder.encoders.0.layers.0.self_attn_weights, attn_weights_entropy = tensor([5.3582, 5.4994, 5.2350, 5.0408], device='cuda:2') 2023-06-27 03:45:30,598 INFO [train.py:1028] (2/4) Epoch 10, validation: loss=0.2595, simple_loss=0.3509, pruned_loss=0.08412, over 1796401.00 frames. 2023-06-27 03:45:30,599 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 23731MB 2023-06-27 03:46:12,046 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1718832.0, ans=0.125 2023-06-27 03:46:36,688 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1718892.0, ans=0.125 2023-06-27 03:47:18,630 INFO [train.py:996] (2/4) Epoch 10, batch 12050, loss[loss=0.1968, simple_loss=0.2679, pruned_loss=0.06281, over 21813.00 frames. ], tot_loss[loss=0.2123, simple_loss=0.2929, pruned_loss=0.0658, over 4254282.63 frames. ], batch size: 247, lr: 2.95e-03, grad_scale: 32.0 2023-06-27 03:47:43,479 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1719072.0, ans=0.1 2023-06-27 03:47:50,472 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1719072.0, ans=0.125 2023-06-27 03:47:53,492 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.219e+02 6.182e+02 8.249e+02 1.335e+03 3.065e+03, threshold=1.650e+03, percent-clipped=10.0 2023-06-27 03:48:23,900 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1719192.0, ans=0.07 2023-06-27 03:49:08,225 INFO [train.py:996] (2/4) Epoch 10, batch 12100, loss[loss=0.1842, simple_loss=0.271, pruned_loss=0.04871, over 19898.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.2973, pruned_loss=0.06923, over 4266554.04 frames. ], batch size: 702, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 03:50:25,696 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1719492.0, ans=0.125 2023-06-27 03:50:43,443 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1719552.0, ans=0.0 2023-06-27 03:50:44,911 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=1719552.0, ans=10.0 2023-06-27 03:51:06,043 INFO [train.py:996] (2/4) Epoch 10, batch 12150, loss[loss=0.1958, simple_loss=0.2755, pruned_loss=0.058, over 21219.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.3001, pruned_loss=0.06885, over 4274275.22 frames. ], batch size: 176, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 03:51:08,508 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1719612.0, ans=0.125 2023-06-27 03:51:14,123 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1719612.0, ans=0.0 2023-06-27 03:51:40,996 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.275e+02 6.507e+02 9.290e+02 1.300e+03 3.036e+03, threshold=1.858e+03, percent-clipped=15.0 2023-06-27 03:52:00,890 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1719732.0, ans=0.0 2023-06-27 03:52:06,166 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1719732.0, ans=0.2 2023-06-27 03:52:14,623 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1719792.0, ans=0.2 2023-06-27 03:52:53,543 INFO [train.py:996] (2/4) Epoch 10, batch 12200, loss[loss=0.2567, simple_loss=0.3751, pruned_loss=0.06921, over 19687.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2983, pruned_loss=0.06825, over 4274833.00 frames. ], batch size: 702, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 03:53:42,183 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1720032.0, ans=0.1 2023-06-27 03:54:37,542 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1720152.0, ans=0.125 2023-06-27 03:54:40,590 INFO [train.py:996] (2/4) Epoch 10, batch 12250, loss[loss=0.2529, simple_loss=0.3738, pruned_loss=0.06598, over 19761.00 frames. ], tot_loss[loss=0.2105, simple_loss=0.291, pruned_loss=0.06503, over 4279163.04 frames. ], batch size: 702, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 03:55:14,851 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.738e+02 5.320e+02 7.726e+02 1.159e+03 2.410e+03, threshold=1.545e+03, percent-clipped=8.0 2023-06-27 03:55:15,606 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1720272.0, ans=0.125 2023-06-27 03:55:19,380 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.20 vs. limit=22.5 2023-06-27 03:56:10,695 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.45 vs. limit=6.0 2023-06-27 03:56:20,762 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.17 vs. limit=15.0 2023-06-27 03:56:25,479 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1720452.0, ans=0.125 2023-06-27 03:56:28,169 INFO [train.py:996] (2/4) Epoch 10, batch 12300, loss[loss=0.1583, simple_loss=0.2433, pruned_loss=0.03662, over 21441.00 frames. ], tot_loss[loss=0.203, simple_loss=0.2845, pruned_loss=0.06075, over 4283901.05 frames. ], batch size: 194, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 03:56:54,617 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1720572.0, ans=0.125 2023-06-27 03:58:10,270 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1720752.0, ans=0.95 2023-06-27 03:58:16,035 INFO [train.py:996] (2/4) Epoch 10, batch 12350, loss[loss=0.2276, simple_loss=0.2914, pruned_loss=0.08189, over 19988.00 frames. ], tot_loss[loss=0.2068, simple_loss=0.2886, pruned_loss=0.06247, over 4278265.92 frames. ], batch size: 703, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 03:58:50,876 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.592e+02 6.371e+02 1.042e+03 1.645e+03 3.511e+03, threshold=2.083e+03, percent-clipped=28.0 2023-06-27 03:59:08,193 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1720932.0, ans=0.125 2023-06-27 03:59:13,764 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.27 vs. limit=12.0 2023-06-27 03:59:37,269 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.05 vs. limit=10.0 2023-06-27 04:00:04,494 INFO [train.py:996] (2/4) Epoch 10, batch 12400, loss[loss=0.2132, simple_loss=0.2833, pruned_loss=0.07158, over 21842.00 frames. ], tot_loss[loss=0.2093, simple_loss=0.2905, pruned_loss=0.064, over 4278222.98 frames. ], batch size: 298, lr: 2.95e-03, grad_scale: 32.0 2023-06-27 04:00:35,531 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.67 vs. limit=12.0 2023-06-27 04:00:49,327 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1721232.0, ans=0.025 2023-06-27 04:00:51,394 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.02 vs. limit=15.0 2023-06-27 04:00:54,177 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1721232.0, ans=0.0 2023-06-27 04:01:05,485 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.89 vs. limit=22.5 2023-06-27 04:01:13,484 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1721292.0, ans=0.0 2023-06-27 04:01:46,877 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1721352.0, ans=10.0 2023-06-27 04:01:58,690 INFO [train.py:996] (2/4) Epoch 10, batch 12450, loss[loss=0.28, simple_loss=0.3591, pruned_loss=0.1005, over 21750.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.2931, pruned_loss=0.06656, over 4275142.59 frames. ], batch size: 118, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 04:02:36,087 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.371e+02 6.019e+02 7.668e+02 9.401e+02 2.639e+03, threshold=1.534e+03, percent-clipped=2.0 2023-06-27 04:03:30,374 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.73 vs. limit=15.0 2023-06-27 04:03:48,674 INFO [train.py:996] (2/4) Epoch 10, batch 12500, loss[loss=0.2351, simple_loss=0.3415, pruned_loss=0.06435, over 21822.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.3036, pruned_loss=0.07047, over 4278978.29 frames. ], batch size: 282, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 04:03:52,682 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1721712.0, ans=0.125 2023-06-27 04:04:19,346 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1721772.0, ans=0.0 2023-06-27 04:04:22,684 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1721772.0, ans=0.2 2023-06-27 04:04:57,211 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.90 vs. limit=15.0 2023-06-27 04:05:20,274 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1721952.0, ans=0.125 2023-06-27 04:05:45,542 INFO [train.py:996] (2/4) Epoch 10, batch 12550, loss[loss=0.2095, simple_loss=0.2916, pruned_loss=0.06373, over 21785.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.309, pruned_loss=0.07172, over 4283377.66 frames. ], batch size: 247, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 04:06:08,762 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1722072.0, ans=0.015 2023-06-27 04:06:27,273 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.271e+02 6.681e+02 8.893e+02 1.594e+03 3.232e+03, threshold=1.779e+03, percent-clipped=27.0 2023-06-27 04:06:32,032 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1722132.0, ans=0.0 2023-06-27 04:06:59,961 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1722192.0, ans=0.0 2023-06-27 04:07:15,081 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.78 vs. limit=15.0 2023-06-27 04:07:19,412 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1722252.0, ans=0.125 2023-06-27 04:07:39,582 INFO [train.py:996] (2/4) Epoch 10, batch 12600, loss[loss=0.2172, simple_loss=0.295, pruned_loss=0.06969, over 20805.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.3064, pruned_loss=0.07047, over 4273445.51 frames. ], batch size: 608, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 04:08:43,217 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1722492.0, ans=0.1 2023-06-27 04:09:06,805 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.16 vs. limit=15.0 2023-06-27 04:09:20,820 INFO [train.py:996] (2/4) Epoch 10, batch 12650, loss[loss=0.2039, simple_loss=0.272, pruned_loss=0.06789, over 21821.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.3004, pruned_loss=0.06749, over 4277248.68 frames. ], batch size: 298, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 04:09:31,014 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.43 vs. limit=15.0 2023-06-27 04:09:31,974 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1722612.0, ans=0.2 2023-06-27 04:09:50,768 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1722672.0, ans=0.125 2023-06-27 04:09:52,352 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1722672.0, ans=0.125 2023-06-27 04:10:02,124 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.614e+02 6.359e+02 1.024e+03 1.411e+03 2.503e+03, threshold=2.048e+03, percent-clipped=9.0 2023-06-27 04:10:11,341 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1722732.0, ans=0.2 2023-06-27 04:10:42,680 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 04:11:13,861 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1722912.0, ans=0.125 2023-06-27 04:11:14,812 INFO [train.py:996] (2/4) Epoch 10, batch 12700, loss[loss=0.2161, simple_loss=0.293, pruned_loss=0.06954, over 21506.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.2995, pruned_loss=0.06882, over 4279058.21 frames. ], batch size: 211, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 04:11:52,645 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.71 vs. limit=15.0 2023-06-27 04:13:08,222 INFO [train.py:996] (2/4) Epoch 10, batch 12750, loss[loss=0.1913, simple_loss=0.2817, pruned_loss=0.05042, over 21802.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.2996, pruned_loss=0.06874, over 4280194.74 frames. ], batch size: 282, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 04:13:17,368 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1723212.0, ans=0.125 2023-06-27 04:13:20,265 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1723212.0, ans=0.125 2023-06-27 04:13:38,771 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.282e+02 6.128e+02 7.827e+02 1.074e+03 2.616e+03, threshold=1.565e+03, percent-clipped=3.0 2023-06-27 04:13:39,722 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1723272.0, ans=0.09899494936611666 2023-06-27 04:13:40,362 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.61 vs. limit=15.0 2023-06-27 04:14:33,730 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1723452.0, ans=0.1 2023-06-27 04:14:55,470 INFO [train.py:996] (2/4) Epoch 10, batch 12800, loss[loss=0.2377, simple_loss=0.3361, pruned_loss=0.06967, over 19889.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.2993, pruned_loss=0.06942, over 4279827.09 frames. ], batch size: 703, lr: 2.95e-03, grad_scale: 32.0 2023-06-27 04:14:56,353 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1723512.0, ans=0.0 2023-06-27 04:15:12,008 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1723572.0, ans=0.1 2023-06-27 04:15:39,667 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1723632.0, ans=0.0 2023-06-27 04:16:45,024 INFO [train.py:996] (2/4) Epoch 10, batch 12850, loss[loss=0.2234, simple_loss=0.2937, pruned_loss=0.07656, over 20141.00 frames. ], tot_loss[loss=0.2212, simple_loss=0.3009, pruned_loss=0.07079, over 4280420.78 frames. ], batch size: 704, lr: 2.95e-03, grad_scale: 32.0 2023-06-27 04:17:22,026 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.972e+02 5.917e+02 7.824e+02 1.083e+03 2.191e+03, threshold=1.565e+03, percent-clipped=6.0 2023-06-27 04:17:29,697 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1723932.0, ans=0.0 2023-06-27 04:18:08,339 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1723992.0, ans=0.1 2023-06-27 04:18:29,755 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1724052.0, ans=0.125 2023-06-27 04:18:34,548 INFO [train.py:996] (2/4) Epoch 10, batch 12900, loss[loss=0.1857, simple_loss=0.2614, pruned_loss=0.05494, over 21266.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2989, pruned_loss=0.06793, over 4276765.21 frames. ], batch size: 176, lr: 2.95e-03, grad_scale: 32.0 2023-06-27 04:19:47,945 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1724292.0, ans=0.1 2023-06-27 04:20:08,036 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1724352.0, ans=0.0 2023-06-27 04:20:08,065 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1724352.0, ans=0.125 2023-06-27 04:20:09,968 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1724352.0, ans=0.125 2023-06-27 04:20:11,870 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1724352.0, ans=0.125 2023-06-27 04:20:22,477 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1724412.0, ans=0.125 2023-06-27 04:20:23,523 INFO [train.py:996] (2/4) Epoch 10, batch 12950, loss[loss=0.2044, simple_loss=0.2899, pruned_loss=0.0595, over 21718.00 frames. ], tot_loss[loss=0.2156, simple_loss=0.2978, pruned_loss=0.06669, over 4274413.63 frames. ], batch size: 332, lr: 2.95e-03, grad_scale: 32.0 2023-06-27 04:20:38,588 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1724412.0, ans=0.125 2023-06-27 04:20:40,213 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1724412.0, ans=0.04949747468305833 2023-06-27 04:21:19,209 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.055e+02 6.814e+02 9.301e+02 1.537e+03 3.645e+03, threshold=1.860e+03, percent-clipped=23.0 2023-06-27 04:21:40,331 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1724592.0, ans=0.125 2023-06-27 04:22:17,959 INFO [train.py:996] (2/4) Epoch 10, batch 13000, loss[loss=0.1968, simple_loss=0.2863, pruned_loss=0.05369, over 21619.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2966, pruned_loss=0.06682, over 4270542.53 frames. ], batch size: 389, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 04:22:25,252 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=1724712.0, ans=0.025 2023-06-27 04:22:58,169 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 04:23:20,371 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.01 vs. limit=22.5 2023-06-27 04:23:26,264 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1724892.0, ans=0.0 2023-06-27 04:24:05,863 INFO [train.py:996] (2/4) Epoch 10, batch 13050, loss[loss=0.1906, simple_loss=0.2672, pruned_loss=0.05704, over 21807.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.2927, pruned_loss=0.0648, over 4269829.96 frames. ], batch size: 247, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 04:24:25,009 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1725072.0, ans=0.125 2023-06-27 04:24:49,086 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.711e+02 5.473e+02 7.954e+02 1.041e+03 2.275e+03, threshold=1.591e+03, percent-clipped=1.0 2023-06-27 04:25:24,672 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.whiten.whitening_limit, batch_count=1725192.0, ans=12.0 2023-06-27 04:25:27,440 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1725192.0, ans=0.125 2023-06-27 04:25:32,019 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.82 vs. limit=15.0 2023-06-27 04:25:53,813 INFO [train.py:996] (2/4) Epoch 10, batch 13100, loss[loss=0.2226, simple_loss=0.3025, pruned_loss=0.07136, over 21328.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.2936, pruned_loss=0.06481, over 4281455.47 frames. ], batch size: 176, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 04:27:06,284 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1725492.0, ans=0.125 2023-06-27 04:27:43,055 INFO [train.py:996] (2/4) Epoch 10, batch 13150, loss[loss=0.2161, simple_loss=0.2911, pruned_loss=0.07061, over 21848.00 frames. ], tot_loss[loss=0.215, simple_loss=0.2961, pruned_loss=0.0669, over 4282333.26 frames. ], batch size: 124, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 04:28:14,271 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1725612.0, ans=0.2 2023-06-27 04:28:17,539 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1725672.0, ans=0.05 2023-06-27 04:28:18,106 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.76 vs. limit=15.0 2023-06-27 04:28:30,902 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1725672.0, ans=0.125 2023-06-27 04:28:32,056 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.070e+02 6.134e+02 8.116e+02 1.164e+03 2.711e+03, threshold=1.623e+03, percent-clipped=9.0 2023-06-27 04:29:02,313 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 04:29:30,634 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1725852.0, ans=0.125 2023-06-27 04:29:37,434 INFO [train.py:996] (2/4) Epoch 10, batch 13200, loss[loss=0.239, simple_loss=0.3125, pruned_loss=0.08278, over 21578.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.295, pruned_loss=0.06699, over 4278879.75 frames. ], batch size: 389, lr: 2.95e-03, grad_scale: 32.0 2023-06-27 04:30:00,865 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1725972.0, ans=0.0 2023-06-27 04:31:20,205 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1726152.0, ans=0.1 2023-06-27 04:31:26,756 INFO [train.py:996] (2/4) Epoch 10, batch 13250, loss[loss=0.2391, simple_loss=0.3109, pruned_loss=0.08369, over 21697.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2951, pruned_loss=0.06862, over 4286915.74 frames. ], batch size: 441, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 04:31:27,820 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1726212.0, ans=0.125 2023-06-27 04:31:44,641 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1726212.0, ans=0.125 2023-06-27 04:31:46,599 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1726212.0, ans=0.1 2023-06-27 04:31:58,497 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1726272.0, ans=0.125 2023-06-27 04:32:06,247 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.999e+02 7.655e+02 1.062e+03 1.668e+03 3.650e+03, threshold=2.123e+03, percent-clipped=27.0 2023-06-27 04:32:22,740 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1726332.0, ans=0.125 2023-06-27 04:32:24,421 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1726332.0, ans=0.125 2023-06-27 04:33:01,324 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1726452.0, ans=0.0 2023-06-27 04:33:21,196 INFO [train.py:996] (2/4) Epoch 10, batch 13300, loss[loss=0.2295, simple_loss=0.3184, pruned_loss=0.07033, over 21798.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.2988, pruned_loss=0.06968, over 4286498.69 frames. ], batch size: 282, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 04:33:30,717 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1726512.0, ans=0.125 2023-06-27 04:34:58,852 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1726752.0, ans=0.125 2023-06-27 04:35:10,301 INFO [train.py:996] (2/4) Epoch 10, batch 13350, loss[loss=0.2399, simple_loss=0.3233, pruned_loss=0.07822, over 21790.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.3027, pruned_loss=0.07117, over 4278693.47 frames. ], batch size: 124, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 04:35:11,695 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.81 vs. limit=15.0 2023-06-27 04:35:43,187 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1726872.0, ans=0.125 2023-06-27 04:35:48,973 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.150e+02 5.865e+02 7.490e+02 1.135e+03 2.182e+03, threshold=1.498e+03, percent-clipped=1.0 2023-06-27 04:36:58,403 INFO [train.py:996] (2/4) Epoch 10, batch 13400, loss[loss=0.2104, simple_loss=0.2845, pruned_loss=0.06811, over 21913.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.3037, pruned_loss=0.07159, over 4268233.71 frames. ], batch size: 316, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 04:36:59,700 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.17 vs. limit=6.0 2023-06-27 04:37:30,730 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.90 vs. limit=12.0 2023-06-27 04:37:46,819 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.57 vs. limit=22.5 2023-06-27 04:38:16,762 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1727292.0, ans=0.0 2023-06-27 04:38:47,932 INFO [train.py:996] (2/4) Epoch 10, batch 13450, loss[loss=0.2818, simple_loss=0.342, pruned_loss=0.1108, over 21456.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.3052, pruned_loss=0.07291, over 4266258.89 frames. ], batch size: 471, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 04:39:06,126 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1727412.0, ans=0.1 2023-06-27 04:39:39,492 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.261e+02 5.946e+02 7.827e+02 1.298e+03 2.826e+03, threshold=1.565e+03, percent-clipped=16.0 2023-06-27 04:39:43,673 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 04:39:45,610 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1727532.0, ans=0.0 2023-06-27 04:39:49,231 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1727532.0, ans=0.125 2023-06-27 04:40:35,239 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1727652.0, ans=0.125 2023-06-27 04:40:43,704 INFO [train.py:996] (2/4) Epoch 10, batch 13500, loss[loss=0.1531, simple_loss=0.2051, pruned_loss=0.05058, over 21830.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.2962, pruned_loss=0.07011, over 4257143.62 frames. ], batch size: 102, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 04:40:49,299 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1727712.0, ans=0.1 2023-06-27 04:41:00,764 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1727712.0, ans=0.2 2023-06-27 04:41:23,213 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.63 vs. limit=15.0 2023-06-27 04:41:23,255 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.19 vs. limit=10.0 2023-06-27 04:41:54,671 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1727892.0, ans=0.015 2023-06-27 04:42:15,165 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1727952.0, ans=0.1 2023-06-27 04:42:24,014 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1727952.0, ans=0.125 2023-06-27 04:42:35,515 INFO [train.py:996] (2/4) Epoch 10, batch 13550, loss[loss=0.3389, simple_loss=0.4122, pruned_loss=0.1328, over 21503.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.2995, pruned_loss=0.07031, over 4264011.07 frames. ], batch size: 507, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 04:42:36,177 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1728012.0, ans=0.2 2023-06-27 04:43:25,548 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.534e+02 7.345e+02 1.395e+03 2.191e+03 3.934e+03, threshold=2.790e+03, percent-clipped=45.0 2023-06-27 04:43:35,244 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.16 vs. limit=15.0 2023-06-27 04:43:58,455 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1728252.0, ans=0.125 2023-06-27 04:44:00,476 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1728252.0, ans=0.0 2023-06-27 04:44:14,024 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1728252.0, ans=0.125 2023-06-27 04:44:21,770 INFO [train.py:996] (2/4) Epoch 10, batch 13600, loss[loss=0.2098, simple_loss=0.295, pruned_loss=0.06231, over 21877.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.2995, pruned_loss=0.07059, over 4271407.54 frames. ], batch size: 371, lr: 2.94e-03, grad_scale: 32.0 2023-06-27 04:44:31,685 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1728312.0, ans=0.1 2023-06-27 04:45:59,037 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1728552.0, ans=0.07 2023-06-27 04:46:00,718 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1728552.0, ans=0.125 2023-06-27 04:46:13,944 INFO [train.py:996] (2/4) Epoch 10, batch 13650, loss[loss=0.1972, simple_loss=0.2596, pruned_loss=0.06741, over 21483.00 frames. ], tot_loss[loss=0.2158, simple_loss=0.295, pruned_loss=0.06833, over 4276046.43 frames. ], batch size: 389, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 04:46:23,034 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1728612.0, ans=0.125 2023-06-27 04:46:48,723 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1728672.0, ans=0.1 2023-06-27 04:46:59,894 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.764e+02 5.044e+02 6.157e+02 8.736e+02 2.830e+03, threshold=1.231e+03, percent-clipped=2.0 2023-06-27 04:47:43,365 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1728852.0, ans=0.125 2023-06-27 04:47:50,507 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1728852.0, ans=0.125 2023-06-27 04:48:02,144 INFO [train.py:996] (2/4) Epoch 10, batch 13700, loss[loss=0.2518, simple_loss=0.3299, pruned_loss=0.08683, over 21630.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.2911, pruned_loss=0.06824, over 4263566.69 frames. ], batch size: 441, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 04:48:15,034 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1728912.0, ans=0.1 2023-06-27 04:48:52,124 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=1729032.0, ans=0.95 2023-06-27 04:49:05,961 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1729032.0, ans=0.0 2023-06-27 04:49:16,417 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1729092.0, ans=0.0 2023-06-27 04:49:50,689 INFO [train.py:996] (2/4) Epoch 10, batch 13750, loss[loss=0.1905, simple_loss=0.2696, pruned_loss=0.05566, over 21584.00 frames. ], tot_loss[loss=0.2108, simple_loss=0.2871, pruned_loss=0.06727, over 4254482.73 frames. ], batch size: 230, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 04:50:44,255 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.944e+02 7.619e+02 1.226e+03 1.767e+03 3.252e+03, threshold=2.451e+03, percent-clipped=47.0 2023-06-27 04:50:52,681 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.81 vs. limit=12.0 2023-06-27 04:51:05,623 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.06 vs. limit=15.0 2023-06-27 04:51:33,908 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1729452.0, ans=0.0 2023-06-27 04:51:44,896 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.48 vs. limit=15.0 2023-06-27 04:51:52,082 INFO [train.py:996] (2/4) Epoch 10, batch 13800, loss[loss=0.2225, simple_loss=0.3222, pruned_loss=0.06135, over 21579.00 frames. ], tot_loss[loss=0.2118, simple_loss=0.292, pruned_loss=0.06577, over 4255674.50 frames. ], batch size: 230, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 04:52:05,184 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.15 vs. limit=12.0 2023-06-27 04:52:06,863 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.43 vs. limit=15.0 2023-06-27 04:52:23,799 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1729572.0, ans=0.125 2023-06-27 04:52:34,805 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1729632.0, ans=10.0 2023-06-27 04:52:53,840 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1729692.0, ans=0.125 2023-06-27 04:53:13,367 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1729692.0, ans=0.0 2023-06-27 04:53:22,067 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.06 vs. limit=6.0 2023-06-27 04:53:39,025 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1729812.0, ans=0.125 2023-06-27 04:53:40,118 INFO [train.py:996] (2/4) Epoch 10, batch 13850, loss[loss=0.2257, simple_loss=0.3077, pruned_loss=0.07186, over 21628.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.3006, pruned_loss=0.0681, over 4256742.90 frames. ], batch size: 263, lr: 2.94e-03, grad_scale: 8.0 2023-06-27 04:54:06,677 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1729872.0, ans=0.07 2023-06-27 04:54:16,159 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.99 vs. limit=22.5 2023-06-27 04:54:23,607 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.256e+02 7.886e+02 1.223e+03 1.813e+03 4.044e+03, threshold=2.445e+03, percent-clipped=7.0 2023-06-27 04:55:11,828 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1730052.0, ans=0.125 2023-06-27 04:55:28,128 INFO [train.py:996] (2/4) Epoch 10, batch 13900, loss[loss=0.2179, simple_loss=0.2945, pruned_loss=0.07061, over 21633.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.3034, pruned_loss=0.07084, over 4263457.56 frames. ], batch size: 230, lr: 2.94e-03, grad_scale: 8.0 2023-06-27 04:55:57,737 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1730172.0, ans=0.0 2023-06-27 04:57:14,356 INFO [train.py:996] (2/4) Epoch 10, batch 13950, loss[loss=0.2524, simple_loss=0.3761, pruned_loss=0.06432, over 19902.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.3033, pruned_loss=0.07229, over 4271740.86 frames. ], batch size: 702, lr: 2.94e-03, grad_scale: 8.0 2023-06-27 04:57:18,746 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1730412.0, ans=0.1 2023-06-27 04:57:44,175 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1730472.0, ans=0.0 2023-06-27 04:57:44,224 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1730472.0, ans=0.025 2023-06-27 04:58:02,051 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.602e+02 6.616e+02 8.570e+02 1.217e+03 2.156e+03, threshold=1.714e+03, percent-clipped=0.0 2023-06-27 04:58:13,161 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1730532.0, ans=0.125 2023-06-27 04:58:13,307 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1730532.0, ans=0.1 2023-06-27 04:58:55,134 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1730652.0, ans=0.04949747468305833 2023-06-27 04:58:58,319 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1730712.0, ans=0.0 2023-06-27 04:58:59,357 INFO [train.py:996] (2/4) Epoch 10, batch 14000, loss[loss=0.2286, simple_loss=0.3211, pruned_loss=0.06805, over 21578.00 frames. ], tot_loss[loss=0.2219, simple_loss=0.3023, pruned_loss=0.07074, over 4272626.88 frames. ], batch size: 471, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 05:00:51,602 INFO [train.py:996] (2/4) Epoch 10, batch 14050, loss[loss=0.1736, simple_loss=0.2438, pruned_loss=0.05172, over 21611.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2975, pruned_loss=0.06755, over 4273580.96 frames. ], batch size: 282, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 05:01:08,222 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.98 vs. limit=15.0 2023-06-27 05:01:33,554 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.649e+02 7.272e+02 1.104e+03 1.609e+03 3.327e+03, threshold=2.207e+03, percent-clipped=18.0 2023-06-27 05:01:42,822 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1731132.0, ans=0.07 2023-06-27 05:02:26,136 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1731312.0, ans=0.1 2023-06-27 05:02:27,182 INFO [train.py:996] (2/4) Epoch 10, batch 14100, loss[loss=0.2167, simple_loss=0.2951, pruned_loss=0.06918, over 21697.00 frames. ], tot_loss[loss=0.2137, simple_loss=0.293, pruned_loss=0.06722, over 4268880.80 frames. ], batch size: 332, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 05:02:28,285 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.87 vs. limit=15.0 2023-06-27 05:02:50,211 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1731372.0, ans=0.125 2023-06-27 05:03:05,813 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1731372.0, ans=0.0 2023-06-27 05:03:10,673 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1731432.0, ans=0.1 2023-06-27 05:03:22,724 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1731432.0, ans=0.04949747468305833 2023-06-27 05:04:01,992 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1731552.0, ans=0.2 2023-06-27 05:04:08,669 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1731552.0, ans=0.0 2023-06-27 05:04:12,971 INFO [train.py:996] (2/4) Epoch 10, batch 14150, loss[loss=0.2227, simple_loss=0.3072, pruned_loss=0.06911, over 21886.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2954, pruned_loss=0.06748, over 4267121.98 frames. ], batch size: 98, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 05:04:44,548 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1731672.0, ans=0.125 2023-06-27 05:04:59,067 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.088e+02 7.057e+02 1.107e+03 1.740e+03 3.584e+03, threshold=2.215e+03, percent-clipped=8.0 2023-06-27 05:05:09,929 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1731732.0, ans=0.125 2023-06-27 05:05:48,300 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1731852.0, ans=0.0 2023-06-27 05:05:55,687 INFO [train.py:996] (2/4) Epoch 10, batch 14200, loss[loss=0.1799, simple_loss=0.2785, pruned_loss=0.04068, over 21662.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.2938, pruned_loss=0.06687, over 4263111.46 frames. ], batch size: 263, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 05:06:42,588 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=1732032.0, ans=15.0 2023-06-27 05:06:43,626 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1732032.0, ans=0.125 2023-06-27 05:07:36,673 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1732152.0, ans=0.125 2023-06-27 05:07:41,170 INFO [train.py:996] (2/4) Epoch 10, batch 14250, loss[loss=0.1739, simple_loss=0.2623, pruned_loss=0.04271, over 21738.00 frames. ], tot_loss[loss=0.2105, simple_loss=0.2884, pruned_loss=0.06629, over 4260596.75 frames. ], batch size: 351, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 05:07:43,807 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1732212.0, ans=0.2 2023-06-27 05:07:56,465 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1732212.0, ans=0.125 2023-06-27 05:07:59,712 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1732272.0, ans=0.125 2023-06-27 05:08:26,591 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=1732332.0, ans=22.5 2023-06-27 05:08:32,536 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.770e+02 5.743e+02 8.448e+02 1.114e+03 2.445e+03, threshold=1.690e+03, percent-clipped=1.0 2023-06-27 05:09:25,900 INFO [train.py:996] (2/4) Epoch 10, batch 14300, loss[loss=0.2015, simple_loss=0.2796, pruned_loss=0.06173, over 17938.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.2906, pruned_loss=0.06611, over 4255700.20 frames. ], batch size: 70, lr: 2.94e-03, grad_scale: 8.0 2023-06-27 05:09:49,209 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1732572.0, ans=0.1 2023-06-27 05:10:05,736 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.65 vs. limit=22.5 2023-06-27 05:10:15,969 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1732632.0, ans=0.125 2023-06-27 05:10:22,731 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1732632.0, ans=0.1 2023-06-27 05:11:14,222 INFO [train.py:996] (2/4) Epoch 10, batch 14350, loss[loss=0.1989, simple_loss=0.2801, pruned_loss=0.05887, over 21946.00 frames. ], tot_loss[loss=0.211, simple_loss=0.2914, pruned_loss=0.06531, over 4245910.58 frames. ], batch size: 316, lr: 2.94e-03, grad_scale: 8.0 2023-06-27 05:11:58,502 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1732932.0, ans=0.125 2023-06-27 05:12:04,617 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.018e+02 7.754e+02 1.154e+03 1.779e+03 3.670e+03, threshold=2.308e+03, percent-clipped=30.0 2023-06-27 05:12:47,574 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1733052.0, ans=0.0 2023-06-27 05:12:57,848 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1733052.0, ans=0.125 2023-06-27 05:13:00,595 INFO [train.py:996] (2/4) Epoch 10, batch 14400, loss[loss=0.2001, simple_loss=0.269, pruned_loss=0.06558, over 21243.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.2901, pruned_loss=0.0666, over 4251880.81 frames. ], batch size: 176, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 05:13:27,349 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.58 vs. limit=15.0 2023-06-27 05:14:02,784 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1733292.0, ans=0.0 2023-06-27 05:14:29,911 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1733352.0, ans=0.0 2023-06-27 05:14:46,470 INFO [train.py:996] (2/4) Epoch 10, batch 14450, loss[loss=0.1818, simple_loss=0.2483, pruned_loss=0.05761, over 21646.00 frames. ], tot_loss[loss=0.2093, simple_loss=0.2855, pruned_loss=0.06655, over 4251505.45 frames. ], batch size: 298, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 05:15:36,480 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.984e+02 5.618e+02 7.352e+02 1.088e+03 2.382e+03, threshold=1.470e+03, percent-clipped=1.0 2023-06-27 05:15:39,312 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1733532.0, ans=0.125 2023-06-27 05:16:11,737 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1733652.0, ans=0.125 2023-06-27 05:16:28,066 INFO [train.py:996] (2/4) Epoch 10, batch 14500, loss[loss=0.2309, simple_loss=0.2964, pruned_loss=0.08269, over 21297.00 frames. ], tot_loss[loss=0.2073, simple_loss=0.2825, pruned_loss=0.06606, over 4256578.91 frames. ], batch size: 471, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 05:16:54,934 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1733772.0, ans=0.0 2023-06-27 05:16:57,285 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.53 vs. limit=15.0 2023-06-27 05:17:29,933 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1733892.0, ans=0.0 2023-06-27 05:17:47,912 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1733892.0, ans=0.1 2023-06-27 05:18:04,512 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.43 vs. limit=10.0 2023-06-27 05:18:12,022 INFO [train.py:996] (2/4) Epoch 10, batch 14550, loss[loss=0.2552, simple_loss=0.3274, pruned_loss=0.09147, over 21595.00 frames. ], tot_loss[loss=0.2105, simple_loss=0.2867, pruned_loss=0.06717, over 4266434.13 frames. ], batch size: 389, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 05:19:02,697 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.284e+02 5.674e+02 7.709e+02 1.144e+03 2.600e+03, threshold=1.542e+03, percent-clipped=15.0 2023-06-27 05:20:05,590 INFO [train.py:996] (2/4) Epoch 10, batch 14600, loss[loss=0.2327, simple_loss=0.3224, pruned_loss=0.07146, over 21650.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.2966, pruned_loss=0.07102, over 4268449.53 frames. ], batch size: 230, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 05:20:10,068 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1734312.0, ans=0.125 2023-06-27 05:20:22,835 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1734372.0, ans=0.2 2023-06-27 05:20:36,667 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1734372.0, ans=0.125 2023-06-27 05:20:50,264 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.98 vs. limit=5.0 2023-06-27 05:21:48,279 INFO [train.py:996] (2/4) Epoch 10, batch 14650, loss[loss=0.2231, simple_loss=0.3057, pruned_loss=0.07027, over 21669.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.2979, pruned_loss=0.07042, over 4275744.20 frames. ], batch size: 230, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 05:22:15,836 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1734672.0, ans=0.125 2023-06-27 05:22:38,513 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1734732.0, ans=0.125 2023-06-27 05:22:39,587 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.998e+02 5.657e+02 7.781e+02 1.109e+03 2.213e+03, threshold=1.556e+03, percent-clipped=10.0 2023-06-27 05:22:53,659 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1734732.0, ans=0.0 2023-06-27 05:23:04,576 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1734792.0, ans=0.0 2023-06-27 05:23:27,327 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1734852.0, ans=0.125 2023-06-27 05:23:37,275 INFO [train.py:996] (2/4) Epoch 10, batch 14700, loss[loss=0.2009, simple_loss=0.3025, pruned_loss=0.04967, over 21781.00 frames. ], tot_loss[loss=0.2118, simple_loss=0.2925, pruned_loss=0.06554, over 4267033.25 frames. ], batch size: 351, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 05:25:05,343 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1735092.0, ans=0.125 2023-06-27 05:25:36,024 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1735152.0, ans=0.1 2023-06-27 05:25:38,811 INFO [train.py:996] (2/4) Epoch 10, batch 14750, loss[loss=0.3039, simple_loss=0.3721, pruned_loss=0.1179, over 21562.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2982, pruned_loss=0.06853, over 4269719.18 frames. ], batch size: 414, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 05:25:39,546 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1735212.0, ans=0.0 2023-06-27 05:25:49,548 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.16 vs. limit=22.5 2023-06-27 05:25:50,550 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1735212.0, ans=0.125 2023-06-27 05:25:50,592 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1735212.0, ans=0.125 2023-06-27 05:26:07,301 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1735272.0, ans=0.125 2023-06-27 05:26:30,560 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.686e+02 7.000e+02 1.273e+03 1.820e+03 3.687e+03, threshold=2.546e+03, percent-clipped=36.0 2023-06-27 05:27:29,175 INFO [train.py:996] (2/4) Epoch 10, batch 14800, loss[loss=0.2266, simple_loss=0.3061, pruned_loss=0.07355, over 21448.00 frames. ], tot_loss[loss=0.226, simple_loss=0.3069, pruned_loss=0.07255, over 4264357.50 frames. ], batch size: 389, lr: 2.94e-03, grad_scale: 32.0 2023-06-27 05:27:57,261 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1735572.0, ans=0.2 2023-06-27 05:28:26,356 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 05:28:58,157 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.08 vs. limit=22.5 2023-06-27 05:29:01,981 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.94 vs. limit=6.0 2023-06-27 05:29:09,302 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1735752.0, ans=0.125 2023-06-27 05:29:10,858 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1735752.0, ans=0.0 2023-06-27 05:29:29,436 INFO [train.py:996] (2/4) Epoch 10, batch 14850, loss[loss=0.1902, simple_loss=0.2588, pruned_loss=0.06077, over 21676.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.3016, pruned_loss=0.07175, over 4262308.37 frames. ], batch size: 299, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 05:29:47,503 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1735872.0, ans=0.2 2023-06-27 05:29:49,317 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1735872.0, ans=0.0 2023-06-27 05:29:49,452 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1735872.0, ans=0.2 2023-06-27 05:30:02,186 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1735872.0, ans=0.125 2023-06-27 05:30:07,792 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.83 vs. limit=15.0 2023-06-27 05:30:16,842 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.100e+02 5.316e+02 7.277e+02 1.299e+03 3.940e+03, threshold=1.455e+03, percent-clipped=5.0 2023-06-27 05:30:52,038 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=1736052.0, ans=6.0 2023-06-27 05:31:19,431 INFO [train.py:996] (2/4) Epoch 10, batch 14900, loss[loss=0.2391, simple_loss=0.3146, pruned_loss=0.08179, over 21506.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.3041, pruned_loss=0.07371, over 4268056.67 frames. ], batch size: 194, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 05:31:36,287 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1736172.0, ans=0.125 2023-06-27 05:32:12,021 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.16 vs. limit=6.0 2023-06-27 05:32:27,182 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1736292.0, ans=0.125 2023-06-27 05:33:06,391 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1736352.0, ans=0.07 2023-06-27 05:33:11,108 INFO [train.py:996] (2/4) Epoch 10, batch 14950, loss[loss=0.2254, simple_loss=0.3054, pruned_loss=0.07274, over 21617.00 frames. ], tot_loss[loss=0.2248, simple_loss=0.3041, pruned_loss=0.07276, over 4264642.32 frames. ], batch size: 263, lr: 2.94e-03, grad_scale: 8.0 2023-06-27 05:34:05,500 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.924e+02 5.785e+02 8.505e+02 1.255e+03 2.502e+03, threshold=1.701e+03, percent-clipped=18.0 2023-06-27 05:34:22,533 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1736592.0, ans=0.125 2023-06-27 05:34:36,337 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1736592.0, ans=0.0 2023-06-27 05:34:40,056 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.32 vs. limit=15.0 2023-06-27 05:34:40,183 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.28 vs. limit=22.5 2023-06-27 05:35:00,006 INFO [train.py:996] (2/4) Epoch 10, batch 15000, loss[loss=0.2861, simple_loss=0.3438, pruned_loss=0.1142, over 21613.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.3063, pruned_loss=0.07428, over 4266365.24 frames. ], batch size: 508, lr: 2.94e-03, grad_scale: 8.0 2023-06-27 05:35:00,007 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-27 05:35:13,681 INFO [zipformer.py:1728] (2/4) name=encoder.encoders.0.layers.1.self_attn_weights, attn_weights_entropy = tensor([4.8820, 4.3604, 4.5921, 4.0857], device='cuda:2') 2023-06-27 05:35:19,886 INFO [train.py:1028] (2/4) Epoch 10, validation: loss=0.2554, simple_loss=0.3462, pruned_loss=0.08227, over 1796401.00 frames. 2023-06-27 05:35:19,887 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 23731MB 2023-06-27 05:35:27,896 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1736712.0, ans=0.2 2023-06-27 05:35:42,153 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1736772.0, ans=0.125 2023-06-27 05:35:53,081 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.18 vs. limit=15.0 2023-06-27 05:36:26,438 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1736892.0, ans=0.125 2023-06-27 05:36:49,786 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1736952.0, ans=0.125 2023-06-27 05:37:03,724 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1737012.0, ans=0.1 2023-06-27 05:37:04,900 INFO [train.py:996] (2/4) Epoch 10, batch 15050, loss[loss=0.2078, simple_loss=0.2957, pruned_loss=0.05992, over 21608.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3083, pruned_loss=0.0752, over 4259871.55 frames. ], batch size: 230, lr: 2.94e-03, grad_scale: 8.0 2023-06-27 05:38:05,488 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.468e+02 6.013e+02 1.020e+03 1.764e+03 3.653e+03, threshold=2.041e+03, percent-clipped=28.0 2023-06-27 05:38:26,823 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1737192.0, ans=0.1 2023-06-27 05:38:30,557 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1737252.0, ans=0.0 2023-06-27 05:38:59,244 INFO [train.py:996] (2/4) Epoch 10, batch 15100, loss[loss=0.2224, simple_loss=0.299, pruned_loss=0.07286, over 21835.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.3122, pruned_loss=0.07483, over 4257134.18 frames. ], batch size: 282, lr: 2.94e-03, grad_scale: 8.0 2023-06-27 05:39:37,047 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1737372.0, ans=0.125 2023-06-27 05:39:47,457 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1737432.0, ans=0.125 2023-06-27 05:39:49,336 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1737432.0, ans=0.125 2023-06-27 05:40:20,700 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1737492.0, ans=0.125 2023-06-27 05:40:48,205 INFO [train.py:996] (2/4) Epoch 10, batch 15150, loss[loss=0.2378, simple_loss=0.2896, pruned_loss=0.093, over 21214.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.3086, pruned_loss=0.07531, over 4254297.31 frames. ], batch size: 471, lr: 2.94e-03, grad_scale: 8.0 2023-06-27 05:40:56,043 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 05:40:57,586 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1737612.0, ans=0.125 2023-06-27 05:41:29,762 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.38 vs. limit=6.0 2023-06-27 05:41:42,597 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.241e+02 5.996e+02 8.329e+02 1.455e+03 4.229e+03, threshold=1.666e+03, percent-clipped=17.0 2023-06-27 05:42:21,711 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1737852.0, ans=0.0 2023-06-27 05:42:36,437 INFO [train.py:996] (2/4) Epoch 10, batch 15200, loss[loss=0.1898, simple_loss=0.2581, pruned_loss=0.06072, over 21209.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.2994, pruned_loss=0.07162, over 4258549.64 frames. ], batch size: 176, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 05:44:22,690 INFO [train.py:996] (2/4) Epoch 10, batch 15250, loss[loss=0.2074, simple_loss=0.2733, pruned_loss=0.07077, over 21744.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.2934, pruned_loss=0.07064, over 4266404.52 frames. ], batch size: 112, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 05:44:51,494 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1738272.0, ans=0.1 2023-06-27 05:45:16,771 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.911e+02 6.076e+02 9.164e+02 1.527e+03 3.060e+03, threshold=1.833e+03, percent-clipped=16.0 2023-06-27 05:46:09,135 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.63 vs. limit=12.0 2023-06-27 05:46:11,061 INFO [train.py:996] (2/4) Epoch 10, batch 15300, loss[loss=0.2258, simple_loss=0.3049, pruned_loss=0.07338, over 21735.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.2959, pruned_loss=0.07337, over 4274671.41 frames. ], batch size: 298, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 05:47:26,510 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1738692.0, ans=0.1 2023-06-27 05:47:39,048 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1738692.0, ans=0.125 2023-06-27 05:47:50,760 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1738752.0, ans=0.125 2023-06-27 05:47:58,652 INFO [train.py:996] (2/4) Epoch 10, batch 15350, loss[loss=0.2645, simple_loss=0.3461, pruned_loss=0.09145, over 21448.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.3012, pruned_loss=0.07484, over 4271206.89 frames. ], batch size: 471, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 05:48:25,009 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1738872.0, ans=0.125 2023-06-27 05:48:26,660 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1738872.0, ans=0.125 2023-06-27 05:48:41,936 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1738932.0, ans=0.125 2023-06-27 05:48:51,337 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.081e+02 6.656e+02 9.808e+02 1.431e+03 3.197e+03, threshold=1.962e+03, percent-clipped=8.0 2023-06-27 05:49:15,972 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1738992.0, ans=0.1 2023-06-27 05:49:17,505 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1738992.0, ans=0.125 2023-06-27 05:49:28,203 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1739052.0, ans=0.1 2023-06-27 05:49:29,984 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1739052.0, ans=0.025 2023-06-27 05:49:37,439 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.57 vs. limit=15.0 2023-06-27 05:49:45,875 INFO [train.py:996] (2/4) Epoch 10, batch 15400, loss[loss=0.1998, simple_loss=0.281, pruned_loss=0.05934, over 21796.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.3037, pruned_loss=0.0734, over 4266904.63 frames. ], batch size: 247, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 05:50:40,790 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.80 vs. limit=10.0 2023-06-27 05:51:21,008 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1739352.0, ans=0.0 2023-06-27 05:51:29,050 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1739352.0, ans=0.015 2023-06-27 05:51:32,555 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1739412.0, ans=0.1 2023-06-27 05:51:33,738 INFO [train.py:996] (2/4) Epoch 10, batch 15450, loss[loss=0.1881, simple_loss=0.2708, pruned_loss=0.05266, over 15765.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.3023, pruned_loss=0.07175, over 4256244.40 frames. ], batch size: 60, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 05:52:28,026 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.164e+02 6.328e+02 9.249e+02 1.410e+03 2.980e+03, threshold=1.850e+03, percent-clipped=8.0 2023-06-27 05:53:29,067 INFO [train.py:996] (2/4) Epoch 10, batch 15500, loss[loss=0.2448, simple_loss=0.3234, pruned_loss=0.08305, over 21591.00 frames. ], tot_loss[loss=0.225, simple_loss=0.3055, pruned_loss=0.07226, over 4258533.79 frames. ], batch size: 389, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 05:53:31,526 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1739712.0, ans=0.125 2023-06-27 05:53:54,491 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1739772.0, ans=0.125 2023-06-27 05:55:23,942 INFO [train.py:996] (2/4) Epoch 10, batch 15550, loss[loss=0.1878, simple_loss=0.2608, pruned_loss=0.05745, over 21763.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.3008, pruned_loss=0.06908, over 4262625.43 frames. ], batch size: 124, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 05:55:59,096 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1740132.0, ans=0.125 2023-06-27 05:56:17,357 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.325e+02 6.960e+02 1.269e+03 1.845e+03 3.300e+03, threshold=2.538e+03, percent-clipped=23.0 2023-06-27 05:56:18,655 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.32 vs. limit=15.0 2023-06-27 05:56:58,188 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.47 vs. limit=15.0 2023-06-27 05:57:11,155 INFO [train.py:996] (2/4) Epoch 10, batch 15600, loss[loss=0.1837, simple_loss=0.2423, pruned_loss=0.06252, over 20761.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2937, pruned_loss=0.0674, over 4263716.72 frames. ], batch size: 609, lr: 2.93e-03, grad_scale: 32.0 2023-06-27 05:57:23,966 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1740312.0, ans=0.2 2023-06-27 05:57:41,324 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1740372.0, ans=0.0 2023-06-27 05:58:55,125 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=5.95 vs. limit=22.5 2023-06-27 05:58:59,230 INFO [train.py:996] (2/4) Epoch 10, batch 15650, loss[loss=0.1897, simple_loss=0.2616, pruned_loss=0.05892, over 21657.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.2915, pruned_loss=0.06671, over 4259057.68 frames. ], batch size: 316, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 05:59:40,014 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.43 vs. limit=15.0 2023-06-27 05:59:49,307 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.999e+02 5.253e+02 8.016e+02 1.068e+03 2.204e+03, threshold=1.603e+03, percent-clipped=0.0 2023-06-27 05:59:49,894 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1740732.0, ans=0.125 2023-06-27 05:59:58,947 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.86 vs. limit=15.0 2023-06-27 06:00:41,588 INFO [train.py:996] (2/4) Epoch 10, batch 15700, loss[loss=0.227, simple_loss=0.2895, pruned_loss=0.0823, over 21293.00 frames. ], tot_loss[loss=0.2099, simple_loss=0.2877, pruned_loss=0.06598, over 4267517.58 frames. ], batch size: 471, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:01:08,030 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1740972.0, ans=0.0 2023-06-27 06:01:29,508 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1741032.0, ans=0.0 2023-06-27 06:01:36,250 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1741032.0, ans=0.05 2023-06-27 06:02:10,419 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1741152.0, ans=0.07 2023-06-27 06:02:28,411 INFO [train.py:996] (2/4) Epoch 10, batch 15750, loss[loss=0.1894, simple_loss=0.2616, pruned_loss=0.05856, over 21382.00 frames. ], tot_loss[loss=0.207, simple_loss=0.2826, pruned_loss=0.0657, over 4268721.91 frames. ], batch size: 131, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:03:22,679 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.061e+02 5.702e+02 8.249e+02 1.125e+03 2.008e+03, threshold=1.650e+03, percent-clipped=7.0 2023-06-27 06:04:11,754 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1741452.0, ans=0.125 2023-06-27 06:04:14,231 INFO [train.py:996] (2/4) Epoch 10, batch 15800, loss[loss=0.2318, simple_loss=0.3003, pruned_loss=0.0817, over 21501.00 frames. ], tot_loss[loss=0.2051, simple_loss=0.2782, pruned_loss=0.06594, over 4270511.49 frames. ], batch size: 389, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:04:25,018 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1741512.0, ans=0.2 2023-06-27 06:04:25,645 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.41 vs. limit=6.0 2023-06-27 06:04:28,522 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1741512.0, ans=0.0 2023-06-27 06:04:30,428 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1741572.0, ans=0.125 2023-06-27 06:04:35,438 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1741572.0, ans=0.125 2023-06-27 06:04:40,361 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 06:04:42,019 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1741572.0, ans=0.0 2023-06-27 06:04:42,154 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1741572.0, ans=0.125 2023-06-27 06:04:49,833 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.39 vs. limit=15.0 2023-06-27 06:04:54,793 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.56 vs. limit=6.0 2023-06-27 06:05:52,839 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1741752.0, ans=0.125 2023-06-27 06:06:00,836 INFO [train.py:996] (2/4) Epoch 10, batch 15850, loss[loss=0.2437, simple_loss=0.3139, pruned_loss=0.0868, over 21681.00 frames. ], tot_loss[loss=0.2091, simple_loss=0.2818, pruned_loss=0.06825, over 4265123.23 frames. ], batch size: 351, lr: 2.93e-03, grad_scale: 8.0 2023-06-27 06:06:57,729 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.205e+02 6.570e+02 8.492e+02 1.187e+03 2.613e+03, threshold=1.698e+03, percent-clipped=9.0 2023-06-27 06:07:17,467 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.80 vs. limit=15.0 2023-06-27 06:07:47,442 INFO [train.py:996] (2/4) Epoch 10, batch 15900, loss[loss=0.1872, simple_loss=0.2554, pruned_loss=0.05955, over 21802.00 frames. ], tot_loss[loss=0.2093, simple_loss=0.2817, pruned_loss=0.06849, over 4249361.34 frames. ], batch size: 352, lr: 2.93e-03, grad_scale: 8.0 2023-06-27 06:08:30,097 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1742232.0, ans=0.125 2023-06-27 06:08:37,582 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=16.99 vs. limit=22.5 2023-06-27 06:08:52,153 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1742292.0, ans=0.125 2023-06-27 06:09:33,392 INFO [train.py:996] (2/4) Epoch 10, batch 15950, loss[loss=0.2164, simple_loss=0.3401, pruned_loss=0.04629, over 19761.00 frames. ], tot_loss[loss=0.2083, simple_loss=0.2839, pruned_loss=0.06632, over 4238993.03 frames. ], batch size: 702, lr: 2.93e-03, grad_scale: 8.0 2023-06-27 06:09:41,181 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1742412.0, ans=0.125 2023-06-27 06:10:16,015 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.54 vs. limit=10.0 2023-06-27 06:10:31,663 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.565e+02 5.245e+02 8.616e+02 1.211e+03 4.191e+03, threshold=1.723e+03, percent-clipped=6.0 2023-06-27 06:10:44,516 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.99 vs. limit=15.0 2023-06-27 06:11:14,845 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.82 vs. limit=15.0 2023-06-27 06:11:21,899 INFO [train.py:996] (2/4) Epoch 10, batch 16000, loss[loss=0.179, simple_loss=0.261, pruned_loss=0.04847, over 21132.00 frames. ], tot_loss[loss=0.2073, simple_loss=0.2857, pruned_loss=0.0644, over 4254715.88 frames. ], batch size: 143, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:12:27,539 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1742892.0, ans=0.125 2023-06-27 06:12:34,813 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1742892.0, ans=0.125 2023-06-27 06:12:41,339 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1742952.0, ans=0.0 2023-06-27 06:13:10,613 INFO [train.py:996] (2/4) Epoch 10, batch 16050, loss[loss=0.2064, simple_loss=0.284, pruned_loss=0.06443, over 21352.00 frames. ], tot_loss[loss=0.2065, simple_loss=0.287, pruned_loss=0.06305, over 4254264.73 frames. ], batch size: 144, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:13:12,904 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1743012.0, ans=0.125 2023-06-27 06:13:12,915 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1743012.0, ans=0.125 2023-06-27 06:13:14,549 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1743012.0, ans=0.125 2023-06-27 06:13:16,725 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.06 vs. limit=15.0 2023-06-27 06:13:28,551 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1743072.0, ans=0.0 2023-06-27 06:14:07,175 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.055e+02 6.829e+02 9.641e+02 1.432e+03 3.603e+03, threshold=1.928e+03, percent-clipped=16.0 2023-06-27 06:14:14,393 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1743192.0, ans=0.2 2023-06-27 06:14:34,852 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1743252.0, ans=0.0 2023-06-27 06:14:51,663 INFO [train.py:996] (2/4) Epoch 10, batch 16100, loss[loss=0.1836, simple_loss=0.2362, pruned_loss=0.06548, over 20724.00 frames. ], tot_loss[loss=0.2083, simple_loss=0.2894, pruned_loss=0.06359, over 4262461.18 frames. ], batch size: 607, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:14:52,276 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1743312.0, ans=0.125 2023-06-27 06:14:57,249 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1743312.0, ans=0.2 2023-06-27 06:15:04,086 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1743312.0, ans=0.0 2023-06-27 06:15:24,231 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1743432.0, ans=0.125 2023-06-27 06:15:29,604 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1743432.0, ans=0.125 2023-06-27 06:16:27,300 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.53 vs. limit=15.0 2023-06-27 06:16:27,672 INFO [train.py:996] (2/4) Epoch 10, batch 16150, loss[loss=0.2089, simple_loss=0.2903, pruned_loss=0.06378, over 21934.00 frames. ], tot_loss[loss=0.2108, simple_loss=0.289, pruned_loss=0.06626, over 4264904.32 frames. ], batch size: 351, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:16:53,689 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.92 vs. limit=22.5 2023-06-27 06:17:05,486 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1743672.0, ans=10.0 2023-06-27 06:17:36,700 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.965e+02 5.965e+02 7.575e+02 1.164e+03 3.405e+03, threshold=1.515e+03, percent-clipped=4.0 2023-06-27 06:17:43,176 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.84 vs. limit=15.0 2023-06-27 06:18:00,056 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1743852.0, ans=0.1 2023-06-27 06:18:27,504 INFO [train.py:996] (2/4) Epoch 10, batch 16200, loss[loss=0.3167, simple_loss=0.3711, pruned_loss=0.1312, over 21352.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.2929, pruned_loss=0.06798, over 4271514.95 frames. ], batch size: 507, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:18:43,803 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1743972.0, ans=0.1 2023-06-27 06:18:43,806 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1743972.0, ans=0.0 2023-06-27 06:18:47,458 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1743972.0, ans=0.05 2023-06-27 06:20:13,806 INFO [train.py:996] (2/4) Epoch 10, batch 16250, loss[loss=0.2254, simple_loss=0.3018, pruned_loss=0.0745, over 21337.00 frames. ], tot_loss[loss=0.2147, simple_loss=0.2937, pruned_loss=0.06781, over 4275488.87 frames. ], batch size: 549, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:20:22,112 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.36 vs. limit=12.0 2023-06-27 06:21:10,959 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.705e+02 5.225e+02 6.820e+02 1.048e+03 2.777e+03, threshold=1.364e+03, percent-clipped=10.0 2023-06-27 06:21:56,487 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.28 vs. limit=15.0 2023-06-27 06:22:00,274 INFO [train.py:996] (2/4) Epoch 10, batch 16300, loss[loss=0.2156, simple_loss=0.3065, pruned_loss=0.06232, over 21430.00 frames. ], tot_loss[loss=0.2086, simple_loss=0.2891, pruned_loss=0.06406, over 4259566.35 frames. ], batch size: 471, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:22:19,293 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1744572.0, ans=0.125 2023-06-27 06:23:09,574 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1744692.0, ans=0.0 2023-06-27 06:23:47,123 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1744812.0, ans=0.2 2023-06-27 06:23:48,234 INFO [train.py:996] (2/4) Epoch 10, batch 16350, loss[loss=0.2435, simple_loss=0.3218, pruned_loss=0.08264, over 21950.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.2897, pruned_loss=0.06538, over 4272124.40 frames. ], batch size: 372, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:23:50,544 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1744812.0, ans=0.125 2023-06-27 06:24:23,684 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1744872.0, ans=0.2 2023-06-27 06:24:30,473 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1744872.0, ans=0.2 2023-06-27 06:24:39,290 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1744932.0, ans=0.1 2023-06-27 06:24:45,632 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.899e+02 6.082e+02 8.252e+02 1.130e+03 2.497e+03, threshold=1.650e+03, percent-clipped=10.0 2023-06-27 06:24:51,481 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1744992.0, ans=0.0 2023-06-27 06:24:51,541 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1744992.0, ans=0.0 2023-06-27 06:25:00,464 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1744992.0, ans=0.125 2023-06-27 06:25:19,463 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1745052.0, ans=0.1 2023-06-27 06:25:29,222 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1745052.0, ans=0.1 2023-06-27 06:25:35,471 INFO [train.py:996] (2/4) Epoch 10, batch 16400, loss[loss=0.2051, simple_loss=0.2735, pruned_loss=0.0683, over 21525.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2938, pruned_loss=0.06661, over 4271052.99 frames. ], batch size: 211, lr: 2.93e-03, grad_scale: 32.0 2023-06-27 06:25:48,268 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1745112.0, ans=0.2 2023-06-27 06:25:59,244 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.36 vs. limit=15.0 2023-06-27 06:27:14,361 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1745352.0, ans=0.125 2023-06-27 06:27:19,392 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1745352.0, ans=0.0 2023-06-27 06:27:22,323 INFO [train.py:996] (2/4) Epoch 10, batch 16450, loss[loss=0.2284, simple_loss=0.2893, pruned_loss=0.0838, over 20008.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.294, pruned_loss=0.06808, over 4286566.09 frames. ], batch size: 702, lr: 2.93e-03, grad_scale: 32.0 2023-06-27 06:27:44,872 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1745412.0, ans=0.1 2023-06-27 06:28:04,177 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1745472.0, ans=0.0 2023-06-27 06:28:19,530 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.272e+02 6.597e+02 9.235e+02 1.601e+03 3.322e+03, threshold=1.847e+03, percent-clipped=22.0 2023-06-27 06:29:07,565 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1745652.0, ans=0.2 2023-06-27 06:29:15,241 INFO [train.py:996] (2/4) Epoch 10, batch 16500, loss[loss=0.2188, simple_loss=0.2962, pruned_loss=0.0707, over 21779.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.2927, pruned_loss=0.06845, over 4280945.81 frames. ], batch size: 332, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:29:20,813 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1745712.0, ans=0.0 2023-06-27 06:30:33,610 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1745892.0, ans=0.2 2023-06-27 06:31:10,045 INFO [train.py:996] (2/4) Epoch 10, batch 16550, loss[loss=0.2095, simple_loss=0.2862, pruned_loss=0.06644, over 21779.00 frames. ], tot_loss[loss=0.2106, simple_loss=0.2885, pruned_loss=0.06634, over 4275843.06 frames. ], batch size: 247, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:32:11,788 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.278e+02 6.354e+02 1.023e+03 1.715e+03 3.969e+03, threshold=2.045e+03, percent-clipped=20.0 2023-06-27 06:32:25,289 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1746192.0, ans=0.125 2023-06-27 06:32:28,720 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1746192.0, ans=0.125 2023-06-27 06:32:54,267 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1746252.0, ans=0.0 2023-06-27 06:33:01,727 INFO [train.py:996] (2/4) Epoch 10, batch 16600, loss[loss=0.2856, simple_loss=0.3867, pruned_loss=0.09224, over 21662.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2956, pruned_loss=0.06836, over 4275274.86 frames. ], batch size: 441, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:33:45,711 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.07 vs. limit=6.0 2023-06-27 06:33:46,653 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1746432.0, ans=0.1 2023-06-27 06:33:52,430 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.06 vs. limit=22.5 2023-06-27 06:34:05,009 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=1746432.0, ans=0.05 2023-06-27 06:34:24,015 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1746492.0, ans=0.0 2023-06-27 06:34:35,363 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1746552.0, ans=0.0 2023-06-27 06:34:50,937 INFO [train.py:996] (2/4) Epoch 10, batch 16650, loss[loss=0.2234, simple_loss=0.3084, pruned_loss=0.06923, over 21710.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.304, pruned_loss=0.07026, over 4280223.96 frames. ], batch size: 332, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:35:09,991 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1746612.0, ans=0.125 2023-06-27 06:35:53,744 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1746732.0, ans=0.125 2023-06-27 06:35:58,145 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.748e+02 7.097e+02 9.518e+02 1.581e+03 3.619e+03, threshold=1.904e+03, percent-clipped=14.0 2023-06-27 06:36:20,615 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1746852.0, ans=0.2 2023-06-27 06:36:48,637 INFO [train.py:996] (2/4) Epoch 10, batch 16700, loss[loss=0.2207, simple_loss=0.3127, pruned_loss=0.06434, over 21046.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.3057, pruned_loss=0.07135, over 4277853.67 frames. ], batch size: 607, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:36:54,447 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1746912.0, ans=0.125 2023-06-27 06:36:58,057 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1746912.0, ans=0.1 2023-06-27 06:37:10,059 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 06:37:25,318 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.67 vs. limit=15.0 2023-06-27 06:37:51,857 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1747092.0, ans=0.0 2023-06-27 06:37:56,981 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1747092.0, ans=0.125 2023-06-27 06:38:30,408 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1747152.0, ans=0.125 2023-06-27 06:38:46,428 INFO [train.py:996] (2/4) Epoch 10, batch 16750, loss[loss=0.2371, simple_loss=0.3215, pruned_loss=0.07631, over 21800.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.3071, pruned_loss=0.07282, over 4273565.57 frames. ], batch size: 124, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:39:00,707 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1747212.0, ans=0.2 2023-06-27 06:39:02,316 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1747212.0, ans=0.1 2023-06-27 06:39:07,978 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1747212.0, ans=0.125 2023-06-27 06:39:17,160 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1747272.0, ans=0.07 2023-06-27 06:39:27,686 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1747332.0, ans=0.125 2023-06-27 06:39:53,246 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.731e+02 7.125e+02 1.124e+03 1.580e+03 3.763e+03, threshold=2.248e+03, percent-clipped=17.0 2023-06-27 06:40:40,771 INFO [train.py:996] (2/4) Epoch 10, batch 16800, loss[loss=0.2037, simple_loss=0.2817, pruned_loss=0.06289, over 21851.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.3097, pruned_loss=0.07254, over 4266434.00 frames. ], batch size: 332, lr: 2.93e-03, grad_scale: 32.0 2023-06-27 06:41:12,254 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1747572.0, ans=0.1 2023-06-27 06:41:24,189 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.69 vs. limit=15.0 2023-06-27 06:41:25,192 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1747632.0, ans=0.0 2023-06-27 06:42:26,711 INFO [train.py:996] (2/4) Epoch 10, batch 16850, loss[loss=0.2234, simple_loss=0.2869, pruned_loss=0.0799, over 21562.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.3061, pruned_loss=0.07302, over 4273906.01 frames. ], batch size: 548, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:43:27,409 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.310e+02 6.690e+02 9.145e+02 1.519e+03 3.869e+03, threshold=1.829e+03, percent-clipped=12.0 2023-06-27 06:43:43,273 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1747992.0, ans=0.04949747468305833 2023-06-27 06:44:12,643 INFO [train.py:996] (2/4) Epoch 10, batch 16900, loss[loss=0.2079, simple_loss=0.2748, pruned_loss=0.07057, over 21243.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.3017, pruned_loss=0.07147, over 4272438.08 frames. ], batch size: 143, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:44:16,533 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1748112.0, ans=0.0 2023-06-27 06:44:37,204 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1748172.0, ans=0.125 2023-06-27 06:45:47,814 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1748352.0, ans=0.0 2023-06-27 06:45:59,667 INFO [train.py:996] (2/4) Epoch 10, batch 16950, loss[loss=0.1967, simple_loss=0.2709, pruned_loss=0.06121, over 21667.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.2968, pruned_loss=0.07043, over 4267417.59 frames. ], batch size: 263, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:46:47,517 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.95 vs. limit=15.0 2023-06-27 06:47:00,165 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.671e+02 6.123e+02 1.009e+03 1.392e+03 3.065e+03, threshold=2.019e+03, percent-clipped=11.0 2023-06-27 06:47:12,956 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.13 vs. limit=12.0 2023-06-27 06:47:47,025 INFO [train.py:996] (2/4) Epoch 10, batch 17000, loss[loss=0.2172, simple_loss=0.2887, pruned_loss=0.07286, over 21581.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.2941, pruned_loss=0.07085, over 4277254.43 frames. ], batch size: 212, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:47:57,145 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.34 vs. limit=15.0 2023-06-27 06:47:58,320 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1748712.0, ans=0.125 2023-06-27 06:48:03,508 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1748772.0, ans=0.0 2023-06-27 06:48:42,004 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1748832.0, ans=0.125 2023-06-27 06:49:32,608 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1748952.0, ans=0.2 2023-06-27 06:49:35,384 INFO [train.py:996] (2/4) Epoch 10, batch 17050, loss[loss=0.2266, simple_loss=0.3133, pruned_loss=0.06993, over 21813.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.3024, pruned_loss=0.07303, over 4281064.37 frames. ], batch size: 351, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:50:23,661 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1749132.0, ans=0.1 2023-06-27 06:50:39,797 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.381e+02 7.817e+02 1.217e+03 1.816e+03 4.089e+03, threshold=2.434e+03, percent-clipped=19.0 2023-06-27 06:51:12,908 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1749252.0, ans=0.125 2023-06-27 06:51:20,972 INFO [train.py:996] (2/4) Epoch 10, batch 17100, loss[loss=0.2089, simple_loss=0.286, pruned_loss=0.06588, over 21457.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.3016, pruned_loss=0.07357, over 4281703.24 frames. ], batch size: 131, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:51:21,589 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1749312.0, ans=0.07 2023-06-27 06:51:47,060 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1749372.0, ans=0.125 2023-06-27 06:52:00,037 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1749372.0, ans=0.0 2023-06-27 06:52:20,364 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1749432.0, ans=0.125 2023-06-27 06:52:29,472 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1749492.0, ans=0.125 2023-06-27 06:52:33,133 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.66 vs. limit=15.0 2023-06-27 06:52:37,415 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1749492.0, ans=0.0 2023-06-27 06:53:00,165 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1749552.0, ans=0.125 2023-06-27 06:53:07,979 INFO [train.py:996] (2/4) Epoch 10, batch 17150, loss[loss=0.1946, simple_loss=0.2713, pruned_loss=0.05892, over 21814.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.2972, pruned_loss=0.07303, over 4278396.35 frames. ], batch size: 118, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:54:16,436 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.998e+02 6.290e+02 9.886e+02 1.236e+03 2.278e+03, threshold=1.977e+03, percent-clipped=0.0 2023-06-27 06:54:34,563 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1749792.0, ans=0.2 2023-06-27 06:55:01,788 INFO [train.py:996] (2/4) Epoch 10, batch 17200, loss[loss=0.2203, simple_loss=0.2938, pruned_loss=0.07336, over 21774.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.2972, pruned_loss=0.07307, over 4281138.46 frames. ], batch size: 247, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:55:36,318 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1749972.0, ans=0.0 2023-06-27 06:56:20,710 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1750092.0, ans=0.125 2023-06-27 06:56:26,546 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.41 vs. limit=22.5 2023-06-27 06:56:46,490 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 06:56:56,967 INFO [train.py:996] (2/4) Epoch 10, batch 17250, loss[loss=0.2398, simple_loss=0.3215, pruned_loss=0.07903, over 21596.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.3, pruned_loss=0.07492, over 4283311.26 frames. ], batch size: 389, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:57:03,301 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.16 vs. limit=12.0 2023-06-27 06:57:05,058 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.48 vs. limit=22.5 2023-06-27 06:57:34,861 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1750272.0, ans=0.125 2023-06-27 06:57:44,927 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1750332.0, ans=0.05 2023-06-27 06:58:00,060 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.260e+02 7.026e+02 1.059e+03 1.492e+03 2.502e+03, threshold=2.118e+03, percent-clipped=5.0 2023-06-27 06:58:50,682 INFO [train.py:996] (2/4) Epoch 10, batch 17300, loss[loss=0.2517, simple_loss=0.3225, pruned_loss=0.09049, over 21315.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.3061, pruned_loss=0.07712, over 4280870.66 frames. ], batch size: 549, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 06:58:54,527 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1750512.0, ans=0.125 2023-06-27 06:59:53,002 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1750692.0, ans=0.125 2023-06-27 06:59:55,149 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1750692.0, ans=0.0 2023-06-27 07:00:03,682 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1750692.0, ans=0.2 2023-06-27 07:00:39,977 INFO [train.py:996] (2/4) Epoch 10, batch 17350, loss[loss=0.1981, simple_loss=0.2835, pruned_loss=0.05636, over 21741.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.3073, pruned_loss=0.07673, over 4277089.82 frames. ], batch size: 247, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:00:48,116 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1750812.0, ans=0.125 2023-06-27 07:00:53,753 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1750812.0, ans=0.125 2023-06-27 07:00:56,105 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.49 vs. limit=15.0 2023-06-27 07:00:57,440 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=1750812.0, ans=15.0 2023-06-27 07:01:43,505 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.572e+02 6.288e+02 8.971e+02 1.269e+03 2.386e+03, threshold=1.794e+03, percent-clipped=3.0 2023-06-27 07:01:47,801 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1750992.0, ans=0.0 2023-06-27 07:02:35,914 INFO [train.py:996] (2/4) Epoch 10, batch 17400, loss[loss=0.2719, simple_loss=0.3527, pruned_loss=0.09555, over 21454.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.3039, pruned_loss=0.0729, over 4278921.21 frames. ], batch size: 471, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:02:42,370 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1751112.0, ans=0.125 2023-06-27 07:03:03,035 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1751172.0, ans=0.125 2023-06-27 07:03:14,267 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.08 vs. limit=15.0 2023-06-27 07:03:16,842 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1751232.0, ans=0.1 2023-06-27 07:04:13,315 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 07:04:17,113 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.14 vs. limit=15.0 2023-06-27 07:04:24,546 INFO [train.py:996] (2/4) Epoch 10, batch 17450, loss[loss=0.1984, simple_loss=0.2949, pruned_loss=0.05094, over 21872.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.3006, pruned_loss=0.0706, over 4278026.66 frames. ], batch size: 373, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:04:33,392 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1751412.0, ans=0.125 2023-06-27 07:04:35,193 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1751412.0, ans=0.0 2023-06-27 07:05:31,487 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.811e+02 5.744e+02 7.670e+02 1.157e+03 3.080e+03, threshold=1.534e+03, percent-clipped=10.0 2023-06-27 07:05:43,983 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1751592.0, ans=0.0 2023-06-27 07:05:55,355 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1751652.0, ans=0.2 2023-06-27 07:06:11,719 INFO [train.py:996] (2/4) Epoch 10, batch 17500, loss[loss=0.2227, simple_loss=0.3007, pruned_loss=0.07229, over 21853.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2965, pruned_loss=0.06912, over 4281854.65 frames. ], batch size: 124, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:06:20,941 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1751712.0, ans=0.1 2023-06-27 07:06:24,287 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1751712.0, ans=0.125 2023-06-27 07:06:33,055 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1751772.0, ans=0.125 2023-06-27 07:06:55,297 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1751832.0, ans=0.0 2023-06-27 07:07:03,795 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1751832.0, ans=0.125 2023-06-27 07:07:59,014 INFO [train.py:996] (2/4) Epoch 10, batch 17550, loss[loss=0.2078, simple_loss=0.2999, pruned_loss=0.05792, over 21698.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.2962, pruned_loss=0.06781, over 4269731.13 frames. ], batch size: 247, lr: 2.92e-03, grad_scale: 8.0 2023-06-27 07:08:13,146 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1752012.0, ans=0.125 2023-06-27 07:08:23,677 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1752072.0, ans=0.125 2023-06-27 07:08:55,202 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1752132.0, ans=0.125 2023-06-27 07:09:08,411 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.979e+02 5.513e+02 7.220e+02 1.144e+03 2.854e+03, threshold=1.444e+03, percent-clipped=10.0 2023-06-27 07:09:26,171 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1752192.0, ans=0.2 2023-06-27 07:09:48,128 INFO [train.py:996] (2/4) Epoch 10, batch 17600, loss[loss=0.2182, simple_loss=0.289, pruned_loss=0.07371, over 21617.00 frames. ], tot_loss[loss=0.2183, simple_loss=0.2991, pruned_loss=0.06876, over 4267177.88 frames. ], batch size: 263, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:10:06,331 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1752312.0, ans=0.125 2023-06-27 07:10:52,171 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1752432.0, ans=0.125 2023-06-27 07:10:53,688 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1752432.0, ans=0.05 2023-06-27 07:11:30,058 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1752552.0, ans=0.09899494936611666 2023-06-27 07:11:36,220 INFO [train.py:996] (2/4) Epoch 10, batch 17650, loss[loss=0.2342, simple_loss=0.3185, pruned_loss=0.07493, over 21278.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2976, pruned_loss=0.06856, over 4261529.69 frames. ], batch size: 143, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:11:54,023 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 07:12:13,063 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1752672.0, ans=0.05 2023-06-27 07:12:15,009 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1752672.0, ans=0.1 2023-06-27 07:12:19,964 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1752732.0, ans=0.125 2023-06-27 07:12:29,096 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1752732.0, ans=0.1 2023-06-27 07:12:51,344 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.366e+02 6.949e+02 1.125e+03 1.736e+03 3.581e+03, threshold=2.249e+03, percent-clipped=33.0 2023-06-27 07:12:52,003 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1752792.0, ans=0.1 2023-06-27 07:12:57,118 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1752792.0, ans=0.1 2023-06-27 07:13:23,767 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1752852.0, ans=0.125 2023-06-27 07:13:30,148 INFO [train.py:996] (2/4) Epoch 10, batch 17700, loss[loss=0.2365, simple_loss=0.3467, pruned_loss=0.06318, over 20794.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2942, pruned_loss=0.06634, over 4257040.46 frames. ], batch size: 608, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:14:34,127 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_positive, batch_count=1753032.0, ans=0.05 2023-06-27 07:14:54,321 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1753092.0, ans=0.125 2023-06-27 07:14:55,980 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1753152.0, ans=0.2 2023-06-27 07:15:24,837 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1753212.0, ans=0.125 2023-06-27 07:15:25,784 INFO [train.py:996] (2/4) Epoch 10, batch 17750, loss[loss=0.1737, simple_loss=0.2823, pruned_loss=0.03258, over 20651.00 frames. ], tot_loss[loss=0.2199, simple_loss=0.3011, pruned_loss=0.06941, over 4260958.36 frames. ], batch size: 608, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:16:14,304 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1753332.0, ans=0.1 2023-06-27 07:16:26,710 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1753332.0, ans=0.125 2023-06-27 07:16:31,249 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.520e+02 6.307e+02 8.574e+02 1.258e+03 1.929e+03, threshold=1.715e+03, percent-clipped=0.0 2023-06-27 07:17:15,970 INFO [train.py:996] (2/4) Epoch 10, batch 17800, loss[loss=0.1915, simple_loss=0.2721, pruned_loss=0.05545, over 21587.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.2998, pruned_loss=0.06845, over 4264109.91 frames. ], batch size: 230, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:18:07,655 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1753632.0, ans=0.125 2023-06-27 07:18:28,839 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1753692.0, ans=0.2 2023-06-27 07:19:09,875 INFO [train.py:996] (2/4) Epoch 10, batch 17850, loss[loss=0.2333, simple_loss=0.3132, pruned_loss=0.07676, over 21839.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.3006, pruned_loss=0.06915, over 4271897.36 frames. ], batch size: 124, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:19:17,554 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1753812.0, ans=0.125 2023-06-27 07:19:38,228 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 07:20:19,506 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.075e+02 5.767e+02 7.778e+02 1.051e+03 2.491e+03, threshold=1.556e+03, percent-clipped=2.0 2023-06-27 07:20:27,190 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 07:20:47,226 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1754052.0, ans=0.125 2023-06-27 07:20:59,126 INFO [train.py:996] (2/4) Epoch 10, batch 17900, loss[loss=0.2517, simple_loss=0.3504, pruned_loss=0.07651, over 21696.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.306, pruned_loss=0.07069, over 4266938.90 frames. ], batch size: 441, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:21:12,373 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1754112.0, ans=0.125 2023-06-27 07:21:19,396 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1754112.0, ans=0.0 2023-06-27 07:21:28,109 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1754172.0, ans=0.125 2023-06-27 07:21:33,772 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1754172.0, ans=0.125 2023-06-27 07:22:19,783 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1754292.0, ans=0.125 2023-06-27 07:22:48,428 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1754352.0, ans=0.0 2023-06-27 07:22:54,518 INFO [train.py:996] (2/4) Epoch 10, batch 17950, loss[loss=0.1809, simple_loss=0.2719, pruned_loss=0.04499, over 21360.00 frames. ], tot_loss[loss=0.2199, simple_loss=0.305, pruned_loss=0.06737, over 4264539.36 frames. ], batch size: 211, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:23:57,658 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.559e+02 6.955e+02 1.067e+03 1.323e+03 3.422e+03, threshold=2.134e+03, percent-clipped=13.0 2023-06-27 07:24:41,319 INFO [train.py:996] (2/4) Epoch 10, batch 18000, loss[loss=0.1923, simple_loss=0.251, pruned_loss=0.06679, over 21516.00 frames. ], tot_loss[loss=0.215, simple_loss=0.2979, pruned_loss=0.06601, over 4262732.32 frames. ], batch size: 212, lr: 2.92e-03, grad_scale: 32.0 2023-06-27 07:24:41,320 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-27 07:24:53,835 INFO [zipformer.py:1728] (2/4) name=encoder.encoders.2.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([1.9238, 4.1986, 3.9889, 4.2662], device='cuda:2') 2023-06-27 07:24:59,835 INFO [train.py:1028] (2/4) Epoch 10, validation: loss=0.2583, simple_loss=0.3514, pruned_loss=0.08255, over 1796401.00 frames. 2023-06-27 07:24:59,836 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 23731MB 2023-06-27 07:25:40,748 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1754772.0, ans=0.2 2023-06-27 07:26:24,146 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1754892.0, ans=0.0 2023-06-27 07:26:48,129 INFO [train.py:996] (2/4) Epoch 10, batch 18050, loss[loss=0.185, simple_loss=0.2648, pruned_loss=0.05256, over 21490.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.2918, pruned_loss=0.06531, over 4254929.37 frames. ], batch size: 389, lr: 2.92e-03, grad_scale: 32.0 2023-06-27 07:27:03,400 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.64 vs. limit=15.0 2023-06-27 07:27:41,228 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.14 vs. limit=15.0 2023-06-27 07:27:54,519 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1755132.0, ans=0.2 2023-06-27 07:28:06,133 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.888e+02 5.313e+02 7.169e+02 9.498e+02 2.481e+03, threshold=1.434e+03, percent-clipped=3.0 2023-06-27 07:28:15,578 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 07:28:37,175 INFO [train.py:996] (2/4) Epoch 10, batch 18100, loss[loss=0.2338, simple_loss=0.306, pruned_loss=0.08074, over 21820.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2956, pruned_loss=0.06769, over 4264908.88 frames. ], batch size: 102, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:29:09,240 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 07:29:52,258 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.14 vs. limit=6.0 2023-06-27 07:29:55,310 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.00 vs. limit=22.5 2023-06-27 07:30:18,774 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1755552.0, ans=0.125 2023-06-27 07:30:24,628 INFO [train.py:996] (2/4) Epoch 10, batch 18150, loss[loss=0.1935, simple_loss=0.2643, pruned_loss=0.06135, over 21250.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2973, pruned_loss=0.06698, over 4261569.81 frames. ], batch size: 159, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:30:43,791 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1755612.0, ans=0.125 2023-06-27 07:30:55,668 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1755672.0, ans=0.0 2023-06-27 07:31:05,956 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1755672.0, ans=0.125 2023-06-27 07:31:23,700 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1755732.0, ans=0.125 2023-06-27 07:31:42,084 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.282e+02 6.043e+02 8.888e+02 1.339e+03 2.734e+03, threshold=1.778e+03, percent-clipped=20.0 2023-06-27 07:32:10,758 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1755912.0, ans=0.125 2023-06-27 07:32:11,833 INFO [train.py:996] (2/4) Epoch 10, batch 18200, loss[loss=0.177, simple_loss=0.2476, pruned_loss=0.05315, over 21298.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.292, pruned_loss=0.0678, over 4264392.76 frames. ], batch size: 551, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:32:54,768 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1756032.0, ans=0.125 2023-06-27 07:32:58,355 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.43 vs. limit=15.0 2023-06-27 07:33:23,640 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.73 vs. limit=15.0 2023-06-27 07:33:47,576 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1756152.0, ans=0.125 2023-06-27 07:33:57,135 INFO [train.py:996] (2/4) Epoch 10, batch 18250, loss[loss=0.1786, simple_loss=0.2497, pruned_loss=0.05374, over 21821.00 frames. ], tot_loss[loss=0.2076, simple_loss=0.2843, pruned_loss=0.0655, over 4269004.72 frames. ], batch size: 102, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:33:58,128 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1756212.0, ans=0.2 2023-06-27 07:34:17,837 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1756272.0, ans=0.125 2023-06-27 07:35:02,249 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1756392.0, ans=0.0 2023-06-27 07:35:06,627 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.839e+02 5.360e+02 7.214e+02 1.131e+03 2.943e+03, threshold=1.443e+03, percent-clipped=6.0 2023-06-27 07:35:13,853 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1756392.0, ans=0.125 2023-06-27 07:35:18,789 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1756452.0, ans=0.5 2023-06-27 07:35:40,494 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1756512.0, ans=0.125 2023-06-27 07:35:41,590 INFO [train.py:996] (2/4) Epoch 10, batch 18300, loss[loss=0.2253, simple_loss=0.3309, pruned_loss=0.05992, over 21806.00 frames. ], tot_loss[loss=0.2073, simple_loss=0.2844, pruned_loss=0.06511, over 4276912.45 frames. ], batch size: 351, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:35:44,448 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.68 vs. limit=8.0 2023-06-27 07:35:46,018 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.26 vs. limit=15.0 2023-06-27 07:36:35,049 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1756632.0, ans=0.0 2023-06-27 07:37:27,284 INFO [train.py:996] (2/4) Epoch 10, batch 18350, loss[loss=0.206, simple_loss=0.2736, pruned_loss=0.06921, over 21191.00 frames. ], tot_loss[loss=0.2095, simple_loss=0.2895, pruned_loss=0.06472, over 4264183.93 frames. ], batch size: 176, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:37:31,463 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1756812.0, ans=0.125 2023-06-27 07:38:39,348 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.953e+02 5.887e+02 8.763e+02 1.316e+03 3.037e+03, threshold=1.753e+03, percent-clipped=16.0 2023-06-27 07:38:41,840 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1756992.0, ans=0.125 2023-06-27 07:39:08,617 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1757052.0, ans=0.0 2023-06-27 07:39:13,567 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1757052.0, ans=0.2 2023-06-27 07:39:16,525 INFO [train.py:996] (2/4) Epoch 10, batch 18400, loss[loss=0.2068, simple_loss=0.2642, pruned_loss=0.0747, over 20219.00 frames. ], tot_loss[loss=0.2071, simple_loss=0.2859, pruned_loss=0.06422, over 4260088.97 frames. ], batch size: 703, lr: 2.92e-03, grad_scale: 32.0 2023-06-27 07:39:26,327 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1757112.0, ans=0.04949747468305833 2023-06-27 07:40:03,215 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1757232.0, ans=0.05 2023-06-27 07:41:04,179 INFO [train.py:996] (2/4) Epoch 10, batch 18450, loss[loss=0.1773, simple_loss=0.2643, pruned_loss=0.04514, over 21663.00 frames. ], tot_loss[loss=0.2019, simple_loss=0.2814, pruned_loss=0.06117, over 4260640.60 frames. ], batch size: 247, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:42:17,468 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.327e+02 4.825e+02 6.029e+02 8.495e+02 1.994e+03, threshold=1.206e+03, percent-clipped=1.0 2023-06-27 07:42:50,200 INFO [train.py:996] (2/4) Epoch 10, batch 18500, loss[loss=0.2024, simple_loss=0.2898, pruned_loss=0.05747, over 21390.00 frames. ], tot_loss[loss=0.198, simple_loss=0.2761, pruned_loss=0.05998, over 4259846.05 frames. ], batch size: 471, lr: 2.92e-03, grad_scale: 8.0 2023-06-27 07:42:57,448 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1757712.0, ans=0.1 2023-06-27 07:43:08,879 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.52 vs. limit=15.0 2023-06-27 07:43:33,591 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1757832.0, ans=0.1 2023-06-27 07:43:33,623 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1757832.0, ans=0.07 2023-06-27 07:43:46,450 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1757832.0, ans=0.0 2023-06-27 07:44:08,383 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1757892.0, ans=0.0 2023-06-27 07:44:11,655 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1757892.0, ans=0.1 2023-06-27 07:44:37,148 INFO [train.py:996] (2/4) Epoch 10, batch 18550, loss[loss=0.1915, simple_loss=0.2565, pruned_loss=0.06328, over 21313.00 frames. ], tot_loss[loss=0.1967, simple_loss=0.2742, pruned_loss=0.05959, over 4251886.32 frames. ], batch size: 177, lr: 2.92e-03, grad_scale: 8.0 2023-06-27 07:44:38,095 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1758012.0, ans=0.125 2023-06-27 07:45:00,406 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1758012.0, ans=0.125 2023-06-27 07:45:13,106 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1758072.0, ans=0.0 2023-06-27 07:45:53,603 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.26 vs. limit=22.5 2023-06-27 07:45:57,525 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.953e+02 6.338e+02 9.700e+02 1.484e+03 3.316e+03, threshold=1.940e+03, percent-clipped=34.0 2023-06-27 07:46:09,781 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1758252.0, ans=0.125 2023-06-27 07:46:24,949 INFO [train.py:996] (2/4) Epoch 10, batch 18600, loss[loss=0.1824, simple_loss=0.257, pruned_loss=0.05389, over 21206.00 frames. ], tot_loss[loss=0.1966, simple_loss=0.2727, pruned_loss=0.06028, over 4250675.63 frames. ], batch size: 159, lr: 2.92e-03, grad_scale: 8.0 2023-06-27 07:46:29,663 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.70 vs. limit=22.5 2023-06-27 07:46:54,369 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1758372.0, ans=0.2 2023-06-27 07:47:19,939 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.47 vs. limit=15.0 2023-06-27 07:47:31,292 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1758432.0, ans=0.2 2023-06-27 07:48:09,086 INFO [train.py:996] (2/4) Epoch 10, batch 18650, loss[loss=0.2026, simple_loss=0.2604, pruned_loss=0.07235, over 21218.00 frames. ], tot_loss[loss=0.1983, simple_loss=0.2742, pruned_loss=0.06116, over 4249596.90 frames. ], batch size: 159, lr: 2.92e-03, grad_scale: 8.0 2023-06-27 07:48:24,534 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1758612.0, ans=0.0 2023-06-27 07:48:38,232 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1758672.0, ans=0.0 2023-06-27 07:49:09,956 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1758732.0, ans=0.125 2023-06-27 07:49:21,085 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.715e+02 5.496e+02 8.127e+02 1.461e+03 3.115e+03, threshold=1.625e+03, percent-clipped=10.0 2023-06-27 07:49:28,396 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1758792.0, ans=0.125 2023-06-27 07:49:39,999 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1758852.0, ans=0.125 2023-06-27 07:49:53,351 INFO [train.py:996] (2/4) Epoch 10, batch 18700, loss[loss=0.213, simple_loss=0.2845, pruned_loss=0.07071, over 21373.00 frames. ], tot_loss[loss=0.1984, simple_loss=0.2724, pruned_loss=0.06222, over 4249596.58 frames. ], batch size: 159, lr: 2.92e-03, grad_scale: 8.0 2023-06-27 07:49:55,576 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1758912.0, ans=0.125 2023-06-27 07:51:01,431 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1759092.0, ans=0.1 2023-06-27 07:51:33,952 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1759152.0, ans=0.0 2023-06-27 07:51:36,977 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1759152.0, ans=0.125 2023-06-27 07:51:39,714 INFO [train.py:996] (2/4) Epoch 10, batch 18750, loss[loss=0.1819, simple_loss=0.2514, pruned_loss=0.05626, over 21610.00 frames. ], tot_loss[loss=0.2017, simple_loss=0.2747, pruned_loss=0.06431, over 4259169.13 frames. ], batch size: 212, lr: 2.92e-03, grad_scale: 8.0 2023-06-27 07:51:55,814 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1759272.0, ans=0.2 2023-06-27 07:52:11,760 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 07:52:52,810 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.039e+02 6.322e+02 1.036e+03 1.574e+03 2.810e+03, threshold=2.072e+03, percent-clipped=23.0 2023-06-27 07:53:10,615 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1759452.0, ans=0.125 2023-06-27 07:53:20,968 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1759452.0, ans=0.125 2023-06-27 07:53:23,153 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.81 vs. limit=15.0 2023-06-27 07:53:25,201 INFO [train.py:996] (2/4) Epoch 10, batch 18800, loss[loss=0.1476, simple_loss=0.2251, pruned_loss=0.03508, over 21732.00 frames. ], tot_loss[loss=0.2057, simple_loss=0.2812, pruned_loss=0.06511, over 4269575.50 frames. ], batch size: 112, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:53:39,985 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.71 vs. limit=15.0 2023-06-27 07:54:27,811 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.50 vs. limit=22.5 2023-06-27 07:54:31,003 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.96 vs. limit=15.0 2023-06-27 07:54:35,387 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1759692.0, ans=0.125 2023-06-27 07:55:04,115 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1759752.0, ans=0.2 2023-06-27 07:55:10,079 INFO [train.py:996] (2/4) Epoch 10, batch 18850, loss[loss=0.2249, simple_loss=0.2989, pruned_loss=0.07545, over 21835.00 frames. ], tot_loss[loss=0.2002, simple_loss=0.2778, pruned_loss=0.06127, over 4271606.86 frames. ], batch size: 102, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:55:16,341 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.07 vs. limit=15.0 2023-06-27 07:56:19,169 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1759992.0, ans=0.125 2023-06-27 07:56:23,524 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.478e+02 5.405e+02 6.936e+02 9.507e+02 2.005e+03, threshold=1.387e+03, percent-clipped=0.0 2023-06-27 07:56:25,786 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1759992.0, ans=0.125 2023-06-27 07:56:56,194 INFO [train.py:996] (2/4) Epoch 10, batch 18900, loss[loss=0.2409, simple_loss=0.294, pruned_loss=0.09393, over 21518.00 frames. ], tot_loss[loss=0.1997, simple_loss=0.2747, pruned_loss=0.06237, over 4262405.82 frames. ], batch size: 473, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:57:44,363 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1760232.0, ans=0.0 2023-06-27 07:57:49,080 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1760232.0, ans=0.0 2023-06-27 07:58:28,858 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1760352.0, ans=0.05 2023-06-27 07:58:41,106 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1760412.0, ans=10.0 2023-06-27 07:58:42,087 INFO [train.py:996] (2/4) Epoch 10, batch 18950, loss[loss=0.1964, simple_loss=0.2709, pruned_loss=0.06096, over 21684.00 frames. ], tot_loss[loss=0.2023, simple_loss=0.2761, pruned_loss=0.06428, over 4258016.63 frames. ], batch size: 263, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:59:16,605 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.77 vs. limit=22.5 2023-06-27 07:59:57,332 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.836e+02 7.333e+02 1.084e+03 1.694e+03 3.772e+03, threshold=2.167e+03, percent-clipped=36.0 2023-06-27 08:00:18,444 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1760652.0, ans=0.125 2023-06-27 08:00:24,681 INFO [train.py:996] (2/4) Epoch 10, batch 19000, loss[loss=0.252, simple_loss=0.3263, pruned_loss=0.08881, over 21261.00 frames. ], tot_loss[loss=0.2089, simple_loss=0.286, pruned_loss=0.06586, over 4255119.32 frames. ], batch size: 159, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 08:02:06,205 INFO [train.py:996] (2/4) Epoch 10, batch 19050, loss[loss=0.2273, simple_loss=0.3051, pruned_loss=0.07475, over 21845.00 frames. ], tot_loss[loss=0.2156, simple_loss=0.2917, pruned_loss=0.06974, over 4266869.02 frames. ], batch size: 351, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 08:03:09,436 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.77 vs. limit=15.0 2023-06-27 08:03:10,271 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1761132.0, ans=0.0 2023-06-27 08:03:13,318 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1761132.0, ans=0.2 2023-06-27 08:03:24,546 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.013e+02 5.908e+02 6.994e+02 9.504e+02 2.053e+03, threshold=1.399e+03, percent-clipped=0.0 2023-06-27 08:03:35,973 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1761252.0, ans=0.125 2023-06-27 08:03:39,866 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.28 vs. limit=15.0 2023-06-27 08:03:46,143 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1761252.0, ans=0.125 2023-06-27 08:03:52,619 INFO [train.py:996] (2/4) Epoch 10, batch 19100, loss[loss=0.1786, simple_loss=0.2455, pruned_loss=0.05583, over 21618.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.2899, pruned_loss=0.06913, over 4272493.13 frames. ], batch size: 247, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 08:04:19,634 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1761372.0, ans=0.05 2023-06-27 08:04:51,651 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1761432.0, ans=0.0 2023-06-27 08:04:58,343 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1761432.0, ans=0.1 2023-06-27 08:05:02,094 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1761492.0, ans=0.125 2023-06-27 08:05:10,493 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1761492.0, ans=0.125 2023-06-27 08:05:26,938 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1761552.0, ans=0.2 2023-06-27 08:05:42,415 INFO [train.py:996] (2/4) Epoch 10, batch 19150, loss[loss=0.1937, simple_loss=0.2614, pruned_loss=0.06304, over 21165.00 frames. ], tot_loss[loss=0.2142, simple_loss=0.29, pruned_loss=0.06923, over 4276620.40 frames. ], batch size: 608, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 08:06:36,512 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1761732.0, ans=0.0 2023-06-27 08:06:53,111 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.207e+02 6.138e+02 1.014e+03 1.599e+03 3.928e+03, threshold=2.029e+03, percent-clipped=32.0 2023-06-27 08:07:02,023 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1761852.0, ans=0.125 2023-06-27 08:07:14,835 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1761852.0, ans=0.125 2023-06-27 08:07:26,294 INFO [train.py:996] (2/4) Epoch 10, batch 19200, loss[loss=0.2145, simple_loss=0.3185, pruned_loss=0.0552, over 21117.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.3009, pruned_loss=0.06982, over 4280533.96 frames. ], batch size: 143, lr: 2.92e-03, grad_scale: 32.0 2023-06-27 08:08:40,786 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1762092.0, ans=0.125 2023-06-27 08:08:48,066 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.46 vs. limit=12.0 2023-06-27 08:09:12,927 INFO [train.py:996] (2/4) Epoch 10, batch 19250, loss[loss=0.2203, simple_loss=0.2994, pruned_loss=0.07064, over 21615.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2994, pruned_loss=0.06539, over 4282865.35 frames. ], batch size: 471, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 08:09:17,042 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1762212.0, ans=0.125 2023-06-27 08:09:45,467 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1762272.0, ans=0.125 2023-06-27 08:09:52,973 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1762272.0, ans=0.125 2023-06-27 08:10:10,477 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1762332.0, ans=0.125 2023-06-27 08:10:23,404 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.711e+02 5.181e+02 6.655e+02 8.936e+02 1.845e+03, threshold=1.331e+03, percent-clipped=0.0 2023-06-27 08:10:59,778 INFO [train.py:996] (2/4) Epoch 10, batch 19300, loss[loss=0.188, simple_loss=0.2931, pruned_loss=0.04139, over 21277.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.2956, pruned_loss=0.0646, over 4285110.08 frames. ], batch size: 548, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 08:11:09,323 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1762512.0, ans=0.125 2023-06-27 08:12:31,242 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1762752.0, ans=0.1 2023-06-27 08:12:31,280 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1762752.0, ans=0.125 2023-06-27 08:12:37,882 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1762752.0, ans=10.0 2023-06-27 08:12:52,826 INFO [train.py:996] (2/4) Epoch 10, batch 19350, loss[loss=0.1697, simple_loss=0.2579, pruned_loss=0.04078, over 21524.00 frames. ], tot_loss[loss=0.2084, simple_loss=0.2918, pruned_loss=0.06257, over 4274418.06 frames. ], batch size: 212, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 08:13:09,453 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.53 vs. limit=10.0 2023-06-27 08:13:10,696 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1762812.0, ans=0.1 2023-06-27 08:13:43,157 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1762932.0, ans=0.125 2023-06-27 08:13:53,371 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1762992.0, ans=0.2 2023-06-27 08:14:03,618 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.583e+02 5.604e+02 8.480e+02 1.112e+03 2.601e+03, threshold=1.696e+03, percent-clipped=20.0 2023-06-27 08:14:11,814 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.81 vs. limit=15.0 2023-06-27 08:14:39,181 INFO [train.py:996] (2/4) Epoch 10, batch 19400, loss[loss=0.204, simple_loss=0.2762, pruned_loss=0.06586, over 21277.00 frames. ], tot_loss[loss=0.2073, simple_loss=0.2911, pruned_loss=0.06177, over 4280114.90 frames. ], batch size: 176, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 08:15:36,082 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.28 vs. limit=15.0 2023-06-27 08:15:38,737 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1763292.0, ans=0.05 2023-06-27 08:15:48,870 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1763292.0, ans=0.125 2023-06-27 08:15:57,686 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.46 vs. limit=6.0 2023-06-27 08:16:23,183 INFO [train.py:996] (2/4) Epoch 10, batch 19450, loss[loss=0.1955, simple_loss=0.2565, pruned_loss=0.06721, over 21418.00 frames. ], tot_loss[loss=0.207, simple_loss=0.288, pruned_loss=0.06298, over 4282143.76 frames. ], batch size: 195, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 08:16:57,255 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 08:17:08,126 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.39 vs. limit=22.5 2023-06-27 08:17:34,302 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.14 vs. limit=15.0 2023-06-27 08:17:34,591 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.998e+02 5.344e+02 8.011e+02 1.240e+03 3.010e+03, threshold=1.602e+03, percent-clipped=14.0 2023-06-27 08:17:38,767 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1763592.0, ans=0.125 2023-06-27 08:18:11,412 INFO [train.py:996] (2/4) Epoch 10, batch 19500, loss[loss=0.2283, simple_loss=0.3061, pruned_loss=0.07527, over 21623.00 frames. ], tot_loss[loss=0.2059, simple_loss=0.2843, pruned_loss=0.0638, over 4271643.12 frames. ], batch size: 414, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 08:18:22,627 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1763712.0, ans=0.0 2023-06-27 08:18:26,005 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1763712.0, ans=0.125 2023-06-27 08:19:28,651 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1763952.0, ans=0.2 2023-06-27 08:19:57,008 INFO [train.py:996] (2/4) Epoch 10, batch 19550, loss[loss=0.11, simple_loss=0.1559, pruned_loss=0.0321, over 17105.00 frames. ], tot_loss[loss=0.2018, simple_loss=0.2794, pruned_loss=0.06216, over 4270809.64 frames. ], batch size: 62, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 08:19:59,090 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1764012.0, ans=0.125 2023-06-27 08:20:09,323 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1764012.0, ans=0.2 2023-06-27 08:20:13,987 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1764072.0, ans=0.0 2023-06-27 08:20:17,719 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1764072.0, ans=0.1 2023-06-27 08:20:24,317 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1764072.0, ans=0.0 2023-06-27 08:20:36,748 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.06 vs. limit=15.0 2023-06-27 08:21:01,412 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.262e+02 6.617e+02 9.937e+02 1.346e+03 3.535e+03, threshold=1.987e+03, percent-clipped=18.0 2023-06-27 08:21:31,026 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 08:21:41,980 INFO [train.py:996] (2/4) Epoch 10, batch 19600, loss[loss=0.213, simple_loss=0.2865, pruned_loss=0.06971, over 21911.00 frames. ], tot_loss[loss=0.205, simple_loss=0.2826, pruned_loss=0.06369, over 4276312.26 frames. ], batch size: 283, lr: 2.91e-03, grad_scale: 32.0 2023-06-27 08:21:59,131 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.93 vs. limit=12.0 2023-06-27 08:23:30,478 INFO [train.py:996] (2/4) Epoch 10, batch 19650, loss[loss=0.236, simple_loss=0.2996, pruned_loss=0.08616, over 21349.00 frames. ], tot_loss[loss=0.2101, simple_loss=0.286, pruned_loss=0.06708, over 4282425.68 frames. ], batch size: 176, lr: 2.91e-03, grad_scale: 32.0 2023-06-27 08:23:50,395 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1764672.0, ans=0.0 2023-06-27 08:24:23,415 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1764732.0, ans=0.125 2023-06-27 08:24:56,582 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.384e+02 6.290e+02 8.079e+02 1.063e+03 2.506e+03, threshold=1.616e+03, percent-clipped=1.0 2023-06-27 08:25:05,375 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1764852.0, ans=0.125 2023-06-27 08:25:22,638 INFO [train.py:996] (2/4) Epoch 10, batch 19700, loss[loss=0.1777, simple_loss=0.2571, pruned_loss=0.04913, over 21433.00 frames. ], tot_loss[loss=0.2115, simple_loss=0.2887, pruned_loss=0.06716, over 4274728.08 frames. ], batch size: 195, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 08:25:39,687 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1764972.0, ans=0.125 2023-06-27 08:26:06,758 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1764972.0, ans=0.1 2023-06-27 08:26:07,483 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.70 vs. limit=10.0 2023-06-27 08:26:38,884 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.42 vs. limit=15.0 2023-06-27 08:26:41,293 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1765092.0, ans=0.125 2023-06-27 08:27:06,240 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1765152.0, ans=0.125 2023-06-27 08:27:12,078 INFO [train.py:996] (2/4) Epoch 10, batch 19750, loss[loss=0.1976, simple_loss=0.2867, pruned_loss=0.05418, over 21305.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2972, pruned_loss=0.06837, over 4261076.52 frames. ], batch size: 176, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 08:28:01,115 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1765332.0, ans=0.0 2023-06-27 08:28:09,282 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1765332.0, ans=0.07 2023-06-27 08:28:33,897 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.148e+02 7.081e+02 1.435e+03 2.263e+03 4.438e+03, threshold=2.870e+03, percent-clipped=43.0 2023-06-27 08:28:58,243 INFO [train.py:996] (2/4) Epoch 10, batch 19800, loss[loss=0.2218, simple_loss=0.3016, pruned_loss=0.07099, over 21782.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.2969, pruned_loss=0.06902, over 4258930.76 frames. ], batch size: 414, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 08:29:11,125 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1765512.0, ans=0.2 2023-06-27 08:30:17,406 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1765692.0, ans=0.0 2023-06-27 08:30:48,434 INFO [train.py:996] (2/4) Epoch 10, batch 19850, loss[loss=0.2382, simple_loss=0.3311, pruned_loss=0.0727, over 21681.00 frames. ], tot_loss[loss=0.2123, simple_loss=0.2923, pruned_loss=0.06611, over 4260652.63 frames. ], batch size: 414, lr: 2.91e-03, grad_scale: 8.0 2023-06-27 08:32:13,173 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.649e+02 5.631e+02 8.956e+02 1.493e+03 4.041e+03, threshold=1.791e+03, percent-clipped=3.0 2023-06-27 08:32:29,833 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.91 vs. limit=15.0 2023-06-27 08:32:35,744 INFO [train.py:996] (2/4) Epoch 10, batch 19900, loss[loss=0.1789, simple_loss=0.2662, pruned_loss=0.04574, over 21630.00 frames. ], tot_loss[loss=0.2089, simple_loss=0.2911, pruned_loss=0.06334, over 4268824.35 frames. ], batch size: 247, lr: 2.91e-03, grad_scale: 8.0 2023-06-27 08:33:32,189 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1766232.0, ans=0.04949747468305833 2023-06-27 08:33:41,470 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1766232.0, ans=0.125 2023-06-27 08:33:44,772 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1766232.0, ans=0.125 2023-06-27 08:34:29,544 INFO [train.py:996] (2/4) Epoch 10, batch 19950, loss[loss=0.1821, simple_loss=0.2528, pruned_loss=0.05572, over 21753.00 frames. ], tot_loss[loss=0.2051, simple_loss=0.2847, pruned_loss=0.06278, over 4274133.41 frames. ], batch size: 118, lr: 2.91e-03, grad_scale: 8.0 2023-06-27 08:35:00,075 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1766412.0, ans=0.125 2023-06-27 08:35:14,562 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.92 vs. limit=15.0 2023-06-27 08:35:34,639 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 08:35:48,516 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1766592.0, ans=0.125 2023-06-27 08:35:49,403 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.611e+02 4.941e+02 6.532e+02 1.016e+03 1.667e+03, threshold=1.306e+03, percent-clipped=0.0 2023-06-27 08:36:13,801 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1766652.0, ans=0.0 2023-06-27 08:36:21,401 INFO [train.py:996] (2/4) Epoch 10, batch 20000, loss[loss=0.2322, simple_loss=0.3063, pruned_loss=0.07905, over 21821.00 frames. ], tot_loss[loss=0.2065, simple_loss=0.2859, pruned_loss=0.06358, over 4273327.00 frames. ], batch size: 371, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 08:37:04,092 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1766772.0, ans=0.1 2023-06-27 08:37:15,124 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1766832.0, ans=0.0 2023-06-27 08:37:18,827 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1766832.0, ans=0.125 2023-06-27 08:37:36,599 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.30 vs. limit=15.0 2023-06-27 08:37:37,834 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1766892.0, ans=0.1 2023-06-27 08:37:52,940 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1766952.0, ans=0.0 2023-06-27 08:38:03,093 INFO [train.py:996] (2/4) Epoch 10, batch 20050, loss[loss=0.2146, simple_loss=0.2945, pruned_loss=0.06736, over 21286.00 frames. ], tot_loss[loss=0.2096, simple_loss=0.2878, pruned_loss=0.06572, over 4283438.01 frames. ], batch size: 176, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 08:38:17,272 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1767012.0, ans=0.125 2023-06-27 08:39:18,082 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.352e+02 5.758e+02 8.028e+02 1.108e+03 2.385e+03, threshold=1.606e+03, percent-clipped=14.0 2023-06-27 08:39:22,494 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1767252.0, ans=0.125 2023-06-27 08:39:25,233 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=1767252.0, ans=15.0 2023-06-27 08:39:47,687 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.85 vs. limit=15.0 2023-06-27 08:39:57,016 INFO [train.py:996] (2/4) Epoch 10, batch 20100, loss[loss=0.222, simple_loss=0.2989, pruned_loss=0.07253, over 21489.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.2904, pruned_loss=0.06787, over 4289632.57 frames. ], batch size: 548, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 08:40:32,054 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 08:40:33,951 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1767432.0, ans=0.2 2023-06-27 08:40:39,021 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1767432.0, ans=0.125 2023-06-27 08:40:45,670 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 08:40:48,158 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.35 vs. limit=15.0 2023-06-27 08:41:26,569 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.56 vs. limit=15.0 2023-06-27 08:41:43,698 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.65 vs. limit=12.0 2023-06-27 08:41:44,177 INFO [train.py:996] (2/4) Epoch 10, batch 20150, loss[loss=0.2687, simple_loss=0.3435, pruned_loss=0.09692, over 21547.00 frames. ], tot_loss[loss=0.22, simple_loss=0.2989, pruned_loss=0.07056, over 4294513.76 frames. ], batch size: 414, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 08:42:32,434 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1767732.0, ans=0.125 2023-06-27 08:43:08,922 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.695e+02 8.899e+02 1.364e+03 1.871e+03 4.503e+03, threshold=2.728e+03, percent-clipped=36.0 2023-06-27 08:43:14,705 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1767852.0, ans=0.125 2023-06-27 08:43:16,428 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1767852.0, ans=0.0 2023-06-27 08:43:31,391 INFO [train.py:996] (2/4) Epoch 10, batch 20200, loss[loss=0.2537, simple_loss=0.3516, pruned_loss=0.07794, over 20769.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.3038, pruned_loss=0.0728, over 4290127.27 frames. ], batch size: 607, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 08:43:55,399 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1767972.0, ans=0.95 2023-06-27 08:44:01,304 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.42 vs. limit=22.5 2023-06-27 08:45:06,167 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1768152.0, ans=0.125 2023-06-27 08:45:19,238 INFO [train.py:996] (2/4) Epoch 10, batch 20250, loss[loss=0.1846, simple_loss=0.3032, pruned_loss=0.03299, over 19839.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.3043, pruned_loss=0.07107, over 4285725.82 frames. ], batch size: 703, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 08:45:23,471 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 08:45:33,870 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1768212.0, ans=0.95 2023-06-27 08:45:40,884 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.02 vs. limit=22.5 2023-06-27 08:46:01,581 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.04 vs. limit=22.5 2023-06-27 08:46:37,988 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.153e+02 5.975e+02 7.847e+02 1.054e+03 2.189e+03, threshold=1.569e+03, percent-clipped=0.0 2023-06-27 08:46:56,046 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.19 vs. limit=22.5 2023-06-27 08:46:59,743 INFO [train.py:996] (2/4) Epoch 10, batch 20300, loss[loss=0.2096, simple_loss=0.2999, pruned_loss=0.05964, over 21762.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.3027, pruned_loss=0.06872, over 4281429.93 frames. ], batch size: 298, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 08:47:27,423 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1768572.0, ans=0.1 2023-06-27 08:47:58,337 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1768632.0, ans=0.125 2023-06-27 08:48:30,733 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1768752.0, ans=0.125 2023-06-27 08:48:40,547 INFO [train.py:996] (2/4) Epoch 10, batch 20350, loss[loss=0.2418, simple_loss=0.3192, pruned_loss=0.0822, over 21439.00 frames. ], tot_loss[loss=0.2212, simple_loss=0.3036, pruned_loss=0.06946, over 4269953.02 frames. ], batch size: 131, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 08:49:04,501 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1768872.0, ans=0.2 2023-06-27 08:49:08,342 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.21 vs. limit=6.0 2023-06-27 08:50:07,285 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.767e+02 5.725e+02 9.452e+02 1.415e+03 2.531e+03, threshold=1.890e+03, percent-clipped=19.0 2023-06-27 08:50:29,319 INFO [train.py:996] (2/4) Epoch 10, batch 20400, loss[loss=0.2407, simple_loss=0.3189, pruned_loss=0.08125, over 21787.00 frames. ], tot_loss[loss=0.224, simple_loss=0.3049, pruned_loss=0.07155, over 4265577.68 frames. ], batch size: 298, lr: 2.91e-03, grad_scale: 32.0 2023-06-27 08:50:38,867 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1769112.0, ans=0.1 2023-06-27 08:50:40,547 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1769112.0, ans=0.1 2023-06-27 08:50:52,733 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1769172.0, ans=0.0 2023-06-27 08:50:59,698 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1769172.0, ans=0.125 2023-06-27 08:51:18,375 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1769172.0, ans=0.0 2023-06-27 08:51:41,441 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1769292.0, ans=0.125 2023-06-27 08:51:46,921 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 08:51:48,725 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1769292.0, ans=0.125 2023-06-27 08:51:54,077 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.29 vs. limit=15.0 2023-06-27 08:52:16,229 INFO [train.py:996] (2/4) Epoch 10, batch 20450, loss[loss=0.2133, simple_loss=0.2785, pruned_loss=0.07406, over 21715.00 frames. ], tot_loss[loss=0.227, simple_loss=0.307, pruned_loss=0.07346, over 4260032.64 frames. ], batch size: 230, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 08:53:01,093 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1769532.0, ans=0.2 2023-06-27 08:53:42,097 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.405e+02 6.127e+02 7.186e+02 1.014e+03 1.873e+03, threshold=1.437e+03, percent-clipped=1.0 2023-06-27 08:54:02,055 INFO [train.py:996] (2/4) Epoch 10, batch 20500, loss[loss=0.2063, simple_loss=0.2733, pruned_loss=0.06972, over 21411.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.3041, pruned_loss=0.07337, over 4259916.13 frames. ], batch size: 194, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 08:54:23,093 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1769772.0, ans=0.2 2023-06-27 08:55:08,102 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1769832.0, ans=0.0 2023-06-27 08:55:48,866 INFO [train.py:996] (2/4) Epoch 10, batch 20550, loss[loss=0.1931, simple_loss=0.2413, pruned_loss=0.07245, over 20238.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.295, pruned_loss=0.07118, over 4250767.17 frames. ], batch size: 703, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 08:56:54,554 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1770132.0, ans=0.0 2023-06-27 08:57:13,817 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1770192.0, ans=0.125 2023-06-27 08:57:14,711 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.809e+02 5.326e+02 8.786e+02 1.599e+03 3.543e+03, threshold=1.757e+03, percent-clipped=26.0 2023-06-27 08:57:27,274 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1770252.0, ans=0.2 2023-06-27 08:57:34,952 INFO [train.py:996] (2/4) Epoch 10, batch 20600, loss[loss=0.2301, simple_loss=0.3, pruned_loss=0.08004, over 21881.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.298, pruned_loss=0.07113, over 4242203.68 frames. ], batch size: 351, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 08:58:44,843 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1770492.0, ans=0.07 2023-06-27 08:59:19,960 INFO [train.py:996] (2/4) Epoch 10, batch 20650, loss[loss=0.1828, simple_loss=0.257, pruned_loss=0.05434, over 21867.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2931, pruned_loss=0.07089, over 4243920.96 frames. ], batch size: 98, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 08:59:22,542 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1770612.0, ans=0.125 2023-06-27 09:00:13,493 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1770732.0, ans=0.0 2023-06-27 09:00:24,236 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=1770732.0, ans=6.0 2023-06-27 09:00:45,101 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.379e+02 5.610e+02 8.426e+02 1.372e+03 2.943e+03, threshold=1.685e+03, percent-clipped=16.0 2023-06-27 09:01:06,348 INFO [train.py:996] (2/4) Epoch 10, batch 20700, loss[loss=0.1857, simple_loss=0.2672, pruned_loss=0.05213, over 21789.00 frames. ], tot_loss[loss=0.2108, simple_loss=0.2863, pruned_loss=0.06764, over 4252826.78 frames. ], batch size: 282, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 09:02:10,848 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 09:02:51,370 INFO [train.py:996] (2/4) Epoch 10, batch 20750, loss[loss=0.2358, simple_loss=0.327, pruned_loss=0.07235, over 21789.00 frames. ], tot_loss[loss=0.2101, simple_loss=0.2865, pruned_loss=0.06681, over 4238300.71 frames. ], batch size: 351, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 09:04:13,391 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.912e+02 7.283e+02 1.259e+03 1.890e+03 5.387e+03, threshold=2.519e+03, percent-clipped=32.0 2023-06-27 09:04:14,591 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.65 vs. limit=10.0 2023-06-27 09:04:39,277 INFO [train.py:996] (2/4) Epoch 10, batch 20800, loss[loss=0.1946, simple_loss=0.2587, pruned_loss=0.06526, over 21231.00 frames. ], tot_loss[loss=0.2137, simple_loss=0.2917, pruned_loss=0.0679, over 4245617.26 frames. ], batch size: 144, lr: 2.91e-03, grad_scale: 32.0 2023-06-27 09:05:52,367 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.99 vs. limit=15.0 2023-06-27 09:06:02,173 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1771752.0, ans=0.0 2023-06-27 09:06:07,668 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=1771752.0, ans=0.5 2023-06-27 09:06:17,656 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1771752.0, ans=0.2 2023-06-27 09:06:20,418 INFO [train.py:996] (2/4) Epoch 10, batch 20850, loss[loss=0.1737, simple_loss=0.2439, pruned_loss=0.05177, over 21642.00 frames. ], tot_loss[loss=0.2094, simple_loss=0.2866, pruned_loss=0.06616, over 4250556.00 frames. ], batch size: 230, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 09:07:11,096 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1771872.0, ans=0.1 2023-06-27 09:07:45,151 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1771992.0, ans=0.125 2023-06-27 09:07:48,085 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.047e+02 6.784e+02 1.035e+03 1.709e+03 3.199e+03, threshold=2.070e+03, percent-clipped=7.0 2023-06-27 09:07:59,244 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1772052.0, ans=0.1 2023-06-27 09:08:12,499 INFO [train.py:996] (2/4) Epoch 10, batch 20900, loss[loss=0.1816, simple_loss=0.2596, pruned_loss=0.05186, over 21772.00 frames. ], tot_loss[loss=0.2083, simple_loss=0.2855, pruned_loss=0.0655, over 4251539.83 frames. ], batch size: 112, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 09:08:38,655 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1772112.0, ans=0.125 2023-06-27 09:08:40,196 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1772172.0, ans=0.0 2023-06-27 09:08:52,936 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.87 vs. limit=15.0 2023-06-27 09:09:13,254 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.97 vs. limit=15.0 2023-06-27 09:09:18,291 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.92 vs. limit=6.0 2023-06-27 09:09:54,332 INFO [train.py:996] (2/4) Epoch 10, batch 20950, loss[loss=0.1887, simple_loss=0.2683, pruned_loss=0.05454, over 21884.00 frames. ], tot_loss[loss=0.2035, simple_loss=0.281, pruned_loss=0.06295, over 4248919.45 frames. ], batch size: 316, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 09:11:18,624 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1772592.0, ans=0.125 2023-06-27 09:11:19,712 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.843e+02 5.854e+02 8.072e+02 1.179e+03 2.171e+03, threshold=1.614e+03, percent-clipped=1.0 2023-06-27 09:11:38,286 INFO [train.py:996] (2/4) Epoch 10, batch 21000, loss[loss=0.2122, simple_loss=0.2891, pruned_loss=0.06768, over 21465.00 frames. ], tot_loss[loss=0.2022, simple_loss=0.279, pruned_loss=0.06271, over 4261389.96 frames. ], batch size: 131, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 09:11:38,287 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-27 09:12:02,880 INFO [train.py:1028] (2/4) Epoch 10, validation: loss=0.2606, simple_loss=0.3545, pruned_loss=0.08334, over 1796401.00 frames. 2023-06-27 09:12:02,881 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 23731MB 2023-06-27 09:12:16,532 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.06 vs. limit=15.0 2023-06-27 09:13:17,887 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1772892.0, ans=0.125 2023-06-27 09:13:37,379 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.73 vs. limit=15.0 2023-06-27 09:13:42,563 INFO [train.py:996] (2/4) Epoch 10, batch 21050, loss[loss=0.2019, simple_loss=0.2752, pruned_loss=0.06427, over 21786.00 frames. ], tot_loss[loss=0.2028, simple_loss=0.2789, pruned_loss=0.06339, over 4264475.90 frames. ], batch size: 371, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 09:13:43,398 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1773012.0, ans=0.1 2023-06-27 09:13:59,826 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1773012.0, ans=0.0 2023-06-27 09:14:25,726 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1773132.0, ans=0.125 2023-06-27 09:14:42,430 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1773192.0, ans=0.125 2023-06-27 09:14:59,147 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.065e+02 6.476e+02 8.191e+02 1.141e+03 2.345e+03, threshold=1.638e+03, percent-clipped=6.0 2023-06-27 09:15:07,438 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.whiten.whitening_limit, batch_count=1773252.0, ans=12.0 2023-06-27 09:15:23,614 INFO [train.py:996] (2/4) Epoch 10, batch 21100, loss[loss=0.2366, simple_loss=0.2751, pruned_loss=0.09908, over 21464.00 frames. ], tot_loss[loss=0.2011, simple_loss=0.2758, pruned_loss=0.06314, over 4254767.42 frames. ], batch size: 511, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 09:15:24,203 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1773312.0, ans=0.0 2023-06-27 09:15:59,115 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1773372.0, ans=0.0 2023-06-27 09:16:16,841 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=9.94 vs. limit=15.0 2023-06-27 09:16:46,549 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1773492.0, ans=0.0 2023-06-27 09:17:08,855 INFO [train.py:996] (2/4) Epoch 10, batch 21150, loss[loss=0.1841, simple_loss=0.2572, pruned_loss=0.05547, over 21681.00 frames. ], tot_loss[loss=0.1992, simple_loss=0.2717, pruned_loss=0.06334, over 4256580.70 frames. ], batch size: 333, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 09:17:41,387 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.99 vs. limit=15.0 2023-06-27 09:18:09,825 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1773732.0, ans=0.1 2023-06-27 09:18:20,251 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 09:18:33,090 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.032e+02 6.384e+02 8.623e+02 1.133e+03 2.526e+03, threshold=1.725e+03, percent-clipped=9.0 2023-06-27 09:18:35,302 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1773852.0, ans=0.1 2023-06-27 09:18:51,945 INFO [train.py:996] (2/4) Epoch 10, batch 21200, loss[loss=0.1681, simple_loss=0.2431, pruned_loss=0.04652, over 21407.00 frames. ], tot_loss[loss=0.1973, simple_loss=0.2687, pruned_loss=0.06298, over 4258616.81 frames. ], batch size: 194, lr: 2.91e-03, grad_scale: 32.0 2023-06-27 09:19:55,433 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.55 vs. limit=15.0 2023-06-27 09:20:03,923 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1774092.0, ans=0.1 2023-06-27 09:20:23,248 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1774152.0, ans=0.025 2023-06-27 09:20:35,204 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1774152.0, ans=0.125 2023-06-27 09:20:44,822 INFO [train.py:996] (2/4) Epoch 10, batch 21250, loss[loss=0.1996, simple_loss=0.2685, pruned_loss=0.06539, over 21816.00 frames. ], tot_loss[loss=0.196, simple_loss=0.2665, pruned_loss=0.06277, over 4262203.65 frames. ], batch size: 98, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 09:20:45,819 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1774212.0, ans=0.125 2023-06-27 09:20:47,846 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.39 vs. limit=12.0 2023-06-27 09:21:01,161 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1774212.0, ans=0.125 2023-06-27 09:21:05,412 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.43 vs. limit=15.0 2023-06-27 09:21:39,102 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1774332.0, ans=0.0 2023-06-27 09:21:52,414 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1774392.0, ans=0.1 2023-06-27 09:22:04,052 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1774392.0, ans=0.2 2023-06-27 09:22:08,474 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.773e+02 6.524e+02 9.029e+02 1.391e+03 2.253e+03, threshold=1.806e+03, percent-clipped=10.0 2023-06-27 09:22:12,991 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.49 vs. limit=10.0 2023-06-27 09:22:22,747 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1774452.0, ans=0.1 2023-06-27 09:22:25,326 INFO [train.py:996] (2/4) Epoch 10, batch 21300, loss[loss=0.1766, simple_loss=0.2441, pruned_loss=0.05456, over 16300.00 frames. ], tot_loss[loss=0.2017, simple_loss=0.2733, pruned_loss=0.06507, over 4261979.67 frames. ], batch size: 60, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 09:22:55,573 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=6.52 vs. limit=22.5 2023-06-27 09:23:19,506 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1774632.0, ans=0.125 2023-06-27 09:23:32,506 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.94 vs. limit=15.0 2023-06-27 09:24:06,989 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 09:24:13,071 INFO [train.py:996] (2/4) Epoch 10, batch 21350, loss[loss=0.2285, simple_loss=0.3141, pruned_loss=0.07151, over 21636.00 frames. ], tot_loss[loss=0.2056, simple_loss=0.2785, pruned_loss=0.06632, over 4275164.94 frames. ], batch size: 441, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 09:25:04,468 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1774932.0, ans=0.0 2023-06-27 09:25:05,102 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.06 vs. limit=22.5 2023-06-27 09:25:09,452 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1774932.0, ans=0.015 2023-06-27 09:25:11,361 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff3.min_abs, batch_count=1774932.0, ans=0.2 2023-06-27 09:25:39,161 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.868e+02 6.492e+02 8.824e+02 1.457e+03 2.432e+03, threshold=1.765e+03, percent-clipped=7.0 2023-06-27 09:25:40,015 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1775052.0, ans=0.125 2023-06-27 09:26:01,056 INFO [train.py:996] (2/4) Epoch 10, batch 21400, loss[loss=0.2498, simple_loss=0.3287, pruned_loss=0.08543, over 21927.00 frames. ], tot_loss[loss=0.2057, simple_loss=0.2804, pruned_loss=0.06545, over 4278355.23 frames. ], batch size: 372, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 09:26:23,647 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1775112.0, ans=0.07 2023-06-27 09:27:05,651 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1775292.0, ans=0.0 2023-06-27 09:27:37,989 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1775352.0, ans=0.125 2023-06-27 09:27:47,919 INFO [train.py:996] (2/4) Epoch 10, batch 21450, loss[loss=0.246, simple_loss=0.3079, pruned_loss=0.09209, over 21709.00 frames. ], tot_loss[loss=0.2097, simple_loss=0.2848, pruned_loss=0.06729, over 4285169.92 frames. ], batch size: 473, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 09:28:14,302 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.38 vs. limit=15.0 2023-06-27 09:28:34,074 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1775532.0, ans=0.2 2023-06-27 09:28:43,115 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.41 vs. limit=15.0 2023-06-27 09:28:47,670 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1775532.0, ans=0.125 2023-06-27 09:29:12,491 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.429e+02 6.380e+02 8.632e+02 1.324e+03 3.087e+03, threshold=1.726e+03, percent-clipped=6.0 2023-06-27 09:29:39,670 INFO [train.py:996] (2/4) Epoch 10, batch 21500, loss[loss=0.2006, simple_loss=0.2634, pruned_loss=0.06891, over 21557.00 frames. ], tot_loss[loss=0.2101, simple_loss=0.2845, pruned_loss=0.06789, over 4276609.17 frames. ], batch size: 391, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 09:30:07,374 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1775772.0, ans=0.0 2023-06-27 09:30:32,755 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=16.87 vs. limit=22.5 2023-06-27 09:30:54,129 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1775952.0, ans=0.125 2023-06-27 09:31:25,403 INFO [train.py:996] (2/4) Epoch 10, batch 21550, loss[loss=0.1819, simple_loss=0.2526, pruned_loss=0.05562, over 21321.00 frames. ], tot_loss[loss=0.2041, simple_loss=0.2776, pruned_loss=0.06535, over 4263224.22 frames. ], batch size: 131, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 09:32:07,578 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1776132.0, ans=0.125 2023-06-27 09:32:11,947 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.70 vs. limit=15.0 2023-06-27 09:32:21,635 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1776192.0, ans=0.1 2023-06-27 09:32:42,963 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1776192.0, ans=0.125 2023-06-27 09:32:44,519 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1776252.0, ans=0.125 2023-06-27 09:32:45,486 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.045e+02 5.690e+02 8.515e+02 1.276e+03 3.905e+03, threshold=1.703e+03, percent-clipped=13.0 2023-06-27 09:32:48,485 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.79 vs. limit=22.5 2023-06-27 09:32:49,919 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1776252.0, ans=0.2 2023-06-27 09:33:06,922 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1776252.0, ans=0.125 2023-06-27 09:33:20,035 INFO [train.py:996] (2/4) Epoch 10, batch 21600, loss[loss=0.2095, simple_loss=0.305, pruned_loss=0.05706, over 21619.00 frames. ], tot_loss[loss=0.2002, simple_loss=0.2731, pruned_loss=0.0636, over 4269783.72 frames. ], batch size: 414, lr: 2.90e-03, grad_scale: 32.0 2023-06-27 09:33:36,851 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1776372.0, ans=0.125 2023-06-27 09:33:55,946 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1776432.0, ans=0.125 2023-06-27 09:35:03,895 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1776552.0, ans=0.125 2023-06-27 09:35:06,703 INFO [train.py:996] (2/4) Epoch 10, batch 21650, loss[loss=0.1975, simple_loss=0.274, pruned_loss=0.06048, over 21802.00 frames. ], tot_loss[loss=0.2004, simple_loss=0.276, pruned_loss=0.06243, over 4269804.06 frames. ], batch size: 102, lr: 2.90e-03, grad_scale: 32.0 2023-06-27 09:36:26,713 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.628e+02 5.833e+02 8.995e+02 1.569e+03 2.622e+03, threshold=1.799e+03, percent-clipped=22.0 2023-06-27 09:36:28,846 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1776852.0, ans=0.2 2023-06-27 09:36:42,820 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1776852.0, ans=0.1 2023-06-27 09:36:53,196 INFO [train.py:996] (2/4) Epoch 10, batch 21700, loss[loss=0.1842, simple_loss=0.2589, pruned_loss=0.05477, over 21864.00 frames. ], tot_loss[loss=0.199, simple_loss=0.276, pruned_loss=0.06095, over 4270813.55 frames. ], batch size: 107, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 09:36:54,222 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.19 vs. limit=15.0 2023-06-27 09:37:17,594 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1776972.0, ans=0.035 2023-06-27 09:37:44,153 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1777092.0, ans=0.125 2023-06-27 09:38:04,150 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1777152.0, ans=0.05 2023-06-27 09:38:17,883 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1777152.0, ans=0.1 2023-06-27 09:38:38,214 INFO [train.py:996] (2/4) Epoch 10, batch 21750, loss[loss=0.1815, simple_loss=0.2504, pruned_loss=0.05632, over 21777.00 frames. ], tot_loss[loss=0.1977, simple_loss=0.2732, pruned_loss=0.06112, over 4258269.10 frames. ], batch size: 317, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 09:39:10,807 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1777332.0, ans=0.2 2023-06-27 09:39:24,740 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.93 vs. limit=15.0 2023-06-27 09:39:26,345 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1777332.0, ans=0.125 2023-06-27 09:39:36,606 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1777392.0, ans=0.0 2023-06-27 09:39:36,699 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1777392.0, ans=0.2 2023-06-27 09:39:58,334 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.215e+02 5.988e+02 7.907e+02 1.038e+03 1.862e+03, threshold=1.581e+03, percent-clipped=2.0 2023-06-27 09:40:03,150 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=1777452.0, ans=10.0 2023-06-27 09:40:24,248 INFO [train.py:996] (2/4) Epoch 10, batch 21800, loss[loss=0.237, simple_loss=0.3073, pruned_loss=0.08336, over 21416.00 frames. ], tot_loss[loss=0.1981, simple_loss=0.2719, pruned_loss=0.06216, over 4267361.94 frames. ], batch size: 473, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 09:40:24,759 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1777512.0, ans=0.0 2023-06-27 09:40:48,585 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1777572.0, ans=0.125 2023-06-27 09:41:14,870 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.68 vs. limit=15.0 2023-06-27 09:42:02,196 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1777752.0, ans=0.0 2023-06-27 09:42:10,440 INFO [train.py:996] (2/4) Epoch 10, batch 21850, loss[loss=0.1882, simple_loss=0.2512, pruned_loss=0.06265, over 21187.00 frames. ], tot_loss[loss=0.2025, simple_loss=0.2786, pruned_loss=0.06322, over 4253532.15 frames. ], batch size: 176, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 09:42:25,479 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1777812.0, ans=0.0 2023-06-27 09:42:34,168 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.70 vs. limit=15.0 2023-06-27 09:43:09,450 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.12 vs. limit=12.0 2023-06-27 09:43:30,246 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.011e+02 6.825e+02 1.374e+03 1.718e+03 3.521e+03, threshold=2.747e+03, percent-clipped=39.0 2023-06-27 09:43:30,764 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1778052.0, ans=0.125 2023-06-27 09:43:32,092 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1778052.0, ans=0.125 2023-06-27 09:43:39,324 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=8.60 vs. limit=15.0 2023-06-27 09:43:52,853 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=1778052.0, ans=0.95 2023-06-27 09:43:54,487 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1778112.0, ans=0.0 2023-06-27 09:43:55,454 INFO [train.py:996] (2/4) Epoch 10, batch 21900, loss[loss=0.183, simple_loss=0.2465, pruned_loss=0.05974, over 21478.00 frames. ], tot_loss[loss=0.2033, simple_loss=0.2775, pruned_loss=0.06457, over 4264577.77 frames. ], batch size: 212, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 09:44:04,730 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1778112.0, ans=0.125 2023-06-27 09:44:08,300 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1778112.0, ans=0.0 2023-06-27 09:44:18,277 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.57 vs. limit=12.0 2023-06-27 09:44:54,858 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 09:45:40,214 INFO [train.py:996] (2/4) Epoch 10, batch 21950, loss[loss=0.1827, simple_loss=0.2508, pruned_loss=0.05726, over 21814.00 frames. ], tot_loss[loss=0.2002, simple_loss=0.2728, pruned_loss=0.0638, over 4266169.26 frames. ], batch size: 118, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 09:46:09,273 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1778472.0, ans=0.0 2023-06-27 09:47:06,227 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.644e+02 5.395e+02 6.536e+02 9.589e+02 2.193e+03, threshold=1.307e+03, percent-clipped=0.0 2023-06-27 09:47:22,452 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1778652.0, ans=0.125 2023-06-27 09:47:26,912 INFO [train.py:996] (2/4) Epoch 10, batch 22000, loss[loss=0.1487, simple_loss=0.2226, pruned_loss=0.03733, over 21617.00 frames. ], tot_loss[loss=0.1943, simple_loss=0.2672, pruned_loss=0.06071, over 4255875.79 frames. ], batch size: 263, lr: 2.90e-03, grad_scale: 32.0 2023-06-27 09:47:33,923 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.98 vs. limit=15.0 2023-06-27 09:48:08,965 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.49 vs. limit=15.0 2023-06-27 09:48:17,107 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1778832.0, ans=0.125 2023-06-27 09:48:37,284 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1778892.0, ans=0.125 2023-06-27 09:48:47,127 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1778892.0, ans=0.1 2023-06-27 09:49:08,606 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1778952.0, ans=0.125 2023-06-27 09:49:08,607 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1778952.0, ans=0.125 2023-06-27 09:49:18,757 INFO [train.py:996] (2/4) Epoch 10, batch 22050, loss[loss=0.1568, simple_loss=0.2305, pruned_loss=0.04156, over 20874.00 frames. ], tot_loss[loss=0.1983, simple_loss=0.2728, pruned_loss=0.06185, over 4246898.47 frames. ], batch size: 609, lr: 2.90e-03, grad_scale: 32.0 2023-06-27 09:49:56,674 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1779132.0, ans=0.125 2023-06-27 09:50:05,688 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1779132.0, ans=0.0 2023-06-27 09:50:13,189 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1779132.0, ans=0.125 2023-06-27 09:50:36,789 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1779192.0, ans=0.125 2023-06-27 09:50:51,719 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.283e+02 7.142e+02 9.080e+02 1.741e+03 3.538e+03, threshold=1.816e+03, percent-clipped=36.0 2023-06-27 09:51:05,041 INFO [train.py:996] (2/4) Epoch 10, batch 22100, loss[loss=0.2164, simple_loss=0.2885, pruned_loss=0.0721, over 21963.00 frames. ], tot_loss[loss=0.2066, simple_loss=0.2816, pruned_loss=0.06585, over 4253404.74 frames. ], batch size: 316, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 09:52:51,252 INFO [train.py:996] (2/4) Epoch 10, batch 22150, loss[loss=0.1896, simple_loss=0.2561, pruned_loss=0.06158, over 21275.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.2856, pruned_loss=0.06739, over 4263055.26 frames. ], batch size: 176, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 09:52:54,282 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.58 vs. limit=15.0 2023-06-27 09:52:59,096 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1779612.0, ans=0.0 2023-06-27 09:53:02,260 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1779612.0, ans=0.0 2023-06-27 09:54:25,382 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.248e+02 5.661e+02 7.475e+02 1.093e+03 2.487e+03, threshold=1.495e+03, percent-clipped=9.0 2023-06-27 09:54:29,897 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1779852.0, ans=0.125 2023-06-27 09:54:39,172 INFO [train.py:996] (2/4) Epoch 10, batch 22200, loss[loss=0.2484, simple_loss=0.3351, pruned_loss=0.08089, over 21727.00 frames. ], tot_loss[loss=0.2133, simple_loss=0.2892, pruned_loss=0.06868, over 4272705.83 frames. ], batch size: 441, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 09:55:13,711 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.76 vs. limit=15.0 2023-06-27 09:55:32,609 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1780032.0, ans=0.125 2023-06-27 09:56:27,595 INFO [train.py:996] (2/4) Epoch 10, batch 22250, loss[loss=0.2625, simple_loss=0.3381, pruned_loss=0.09345, over 21552.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.2963, pruned_loss=0.07063, over 4269544.48 frames. ], batch size: 414, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 09:56:54,169 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1780272.0, ans=0.125 2023-06-27 09:57:01,828 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.96 vs. limit=15.0 2023-06-27 09:57:58,532 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.450e+02 6.092e+02 1.077e+03 1.479e+03 2.486e+03, threshold=2.154e+03, percent-clipped=24.0 2023-06-27 09:58:12,289 INFO [train.py:996] (2/4) Epoch 10, batch 22300, loss[loss=0.2019, simple_loss=0.2671, pruned_loss=0.06834, over 21846.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.297, pruned_loss=0.07167, over 4274963.39 frames. ], batch size: 247, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 09:58:24,563 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1780512.0, ans=0.125 2023-06-27 09:59:06,756 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1780632.0, ans=0.1 2023-06-27 09:59:58,435 INFO [train.py:996] (2/4) Epoch 10, batch 22350, loss[loss=0.2215, simple_loss=0.2826, pruned_loss=0.08014, over 21570.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.2957, pruned_loss=0.07277, over 4285907.54 frames. ], batch size: 548, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 10:01:32,057 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.877e+02 5.424e+02 7.099e+02 9.642e+02 1.783e+03, threshold=1.420e+03, percent-clipped=0.0 2023-06-27 10:01:45,474 INFO [train.py:996] (2/4) Epoch 10, batch 22400, loss[loss=0.2241, simple_loss=0.2871, pruned_loss=0.08061, over 21500.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.292, pruned_loss=0.06956, over 4289042.65 frames. ], batch size: 441, lr: 2.90e-03, grad_scale: 32.0 2023-06-27 10:03:02,874 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1781292.0, ans=0.125 2023-06-27 10:03:30,142 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1781412.0, ans=0.2 2023-06-27 10:03:31,239 INFO [train.py:996] (2/4) Epoch 10, batch 22450, loss[loss=0.1816, simple_loss=0.2452, pruned_loss=0.05898, over 21610.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.2859, pruned_loss=0.06865, over 4282496.01 frames. ], batch size: 231, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 10:04:29,744 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1781532.0, ans=0.0 2023-06-27 10:04:33,768 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1781532.0, ans=0.125 2023-06-27 10:05:07,720 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.016e+02 6.073e+02 8.479e+02 1.179e+03 3.261e+03, threshold=1.696e+03, percent-clipped=18.0 2023-06-27 10:05:15,785 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1781652.0, ans=0.2 2023-06-27 10:05:18,409 INFO [train.py:996] (2/4) Epoch 10, batch 22500, loss[loss=0.2052, simple_loss=0.2922, pruned_loss=0.0591, over 21625.00 frames. ], tot_loss[loss=0.2074, simple_loss=0.2807, pruned_loss=0.06706, over 4287704.99 frames. ], batch size: 263, lr: 2.90e-03, grad_scale: 8.0 2023-06-27 10:05:33,621 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.98 vs. limit=22.5 2023-06-27 10:06:41,053 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1781892.0, ans=0.125 2023-06-27 10:07:06,844 INFO [train.py:996] (2/4) Epoch 10, batch 22550, loss[loss=0.1988, simple_loss=0.2775, pruned_loss=0.06008, over 21931.00 frames. ], tot_loss[loss=0.2087, simple_loss=0.283, pruned_loss=0.06721, over 4284961.83 frames. ], batch size: 316, lr: 2.90e-03, grad_scale: 8.0 2023-06-27 10:08:41,507 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.419e+02 6.178e+02 1.242e+03 1.950e+03 4.739e+03, threshold=2.485e+03, percent-clipped=29.0 2023-06-27 10:08:51,924 INFO [train.py:996] (2/4) Epoch 10, batch 22600, loss[loss=0.2228, simple_loss=0.3082, pruned_loss=0.06871, over 21748.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.2869, pruned_loss=0.06797, over 4285539.42 frames. ], batch size: 351, lr: 2.90e-03, grad_scale: 8.0 2023-06-27 10:08:52,463 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_na.min_abs, batch_count=1782312.0, ans=0.02 2023-06-27 10:09:01,476 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1782312.0, ans=0.0 2023-06-27 10:09:23,930 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1782372.0, ans=0.125 2023-06-27 10:09:48,449 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.66 vs. limit=15.0 2023-06-27 10:10:05,232 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1782492.0, ans=0.0 2023-06-27 10:10:38,556 INFO [train.py:996] (2/4) Epoch 10, batch 22650, loss[loss=0.1834, simple_loss=0.2542, pruned_loss=0.05626, over 21625.00 frames. ], tot_loss[loss=0.2094, simple_loss=0.2837, pruned_loss=0.06755, over 4270228.25 frames. ], batch size: 298, lr: 2.90e-03, grad_scale: 8.0 2023-06-27 10:10:55,119 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1782612.0, ans=0.125 2023-06-27 10:11:39,369 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 10:11:45,818 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1782792.0, ans=0.0 2023-06-27 10:12:00,188 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1782792.0, ans=0.5 2023-06-27 10:12:16,227 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.186e+02 6.070e+02 1.000e+03 1.313e+03 3.118e+03, threshold=2.001e+03, percent-clipped=3.0 2023-06-27 10:12:26,410 INFO [train.py:996] (2/4) Epoch 10, batch 22700, loss[loss=0.1911, simple_loss=0.2527, pruned_loss=0.06471, over 21639.00 frames. ], tot_loss[loss=0.2051, simple_loss=0.2772, pruned_loss=0.06647, over 4274196.99 frames. ], batch size: 333, lr: 2.90e-03, grad_scale: 8.0 2023-06-27 10:12:35,606 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1782912.0, ans=10.0 2023-06-27 10:12:47,643 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1782912.0, ans=0.125 2023-06-27 10:14:12,792 INFO [train.py:996] (2/4) Epoch 10, batch 22750, loss[loss=0.2415, simple_loss=0.355, pruned_loss=0.06399, over 19673.00 frames. ], tot_loss[loss=0.2093, simple_loss=0.2812, pruned_loss=0.06871, over 4274252.44 frames. ], batch size: 703, lr: 2.90e-03, grad_scale: 8.0 2023-06-27 10:14:26,181 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1783212.0, ans=0.125 2023-06-27 10:14:43,105 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1783212.0, ans=0.125 2023-06-27 10:14:53,091 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1783272.0, ans=0.125 2023-06-27 10:14:58,046 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1783272.0, ans=0.125 2023-06-27 10:15:06,752 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1783332.0, ans=0.0 2023-06-27 10:15:15,968 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=1783332.0, ans=15.0 2023-06-27 10:15:27,087 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1783392.0, ans=0.2 2023-06-27 10:15:48,223 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.913e+02 6.203e+02 1.029e+03 1.531e+03 3.011e+03, threshold=2.057e+03, percent-clipped=6.0 2023-06-27 10:16:04,123 INFO [train.py:996] (2/4) Epoch 10, batch 22800, loss[loss=0.2239, simple_loss=0.2943, pruned_loss=0.07677, over 21732.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.2839, pruned_loss=0.07079, over 4278687.09 frames. ], batch size: 389, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 10:16:18,156 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 10:16:19,802 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1783512.0, ans=0.0 2023-06-27 10:17:17,800 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1783692.0, ans=0.95 2023-06-27 10:17:44,766 INFO [train.py:996] (2/4) Epoch 10, batch 22850, loss[loss=0.1973, simple_loss=0.2562, pruned_loss=0.06921, over 21257.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.2812, pruned_loss=0.07046, over 4274936.03 frames. ], batch size: 176, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 10:18:20,670 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1783872.0, ans=0.125 2023-06-27 10:18:44,238 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1783932.0, ans=0.125 2023-06-27 10:19:02,060 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.95 vs. limit=15.0 2023-06-27 10:19:23,326 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.547e+02 9.815e+02 1.470e+03 2.221e+03 4.175e+03, threshold=2.939e+03, percent-clipped=31.0 2023-06-27 10:19:43,604 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1784112.0, ans=0.125 2023-06-27 10:19:44,587 INFO [train.py:996] (2/4) Epoch 10, batch 22900, loss[loss=0.2374, simple_loss=0.3458, pruned_loss=0.06456, over 21652.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.2833, pruned_loss=0.06944, over 4274402.37 frames. ], batch size: 414, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 10:20:26,277 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.30 vs. limit=15.0 2023-06-27 10:21:31,357 INFO [train.py:996] (2/4) Epoch 10, batch 22950, loss[loss=0.2597, simple_loss=0.3877, pruned_loss=0.06587, over 21591.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2964, pruned_loss=0.06845, over 4276101.24 frames. ], batch size: 389, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 10:21:39,257 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.69 vs. limit=15.0 2023-06-27 10:22:10,451 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1784532.0, ans=0.0 2023-06-27 10:22:12,016 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1784532.0, ans=0.125 2023-06-27 10:22:17,087 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1784532.0, ans=0.95 2023-06-27 10:22:47,071 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1784652.0, ans=0.125 2023-06-27 10:22:50,566 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1784652.0, ans=0.5 2023-06-27 10:22:51,598 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.863e+02 5.875e+02 8.793e+02 1.271e+03 3.173e+03, threshold=1.759e+03, percent-clipped=4.0 2023-06-27 10:23:05,431 INFO [train.py:996] (2/4) Epoch 10, batch 23000, loss[loss=0.2398, simple_loss=0.3076, pruned_loss=0.086, over 21828.00 frames. ], tot_loss[loss=0.2156, simple_loss=0.2979, pruned_loss=0.06664, over 4281214.36 frames. ], batch size: 414, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 10:23:09,723 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1784712.0, ans=0.2 2023-06-27 10:23:33,776 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1784772.0, ans=0.125 2023-06-27 10:23:37,051 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1784832.0, ans=0.04949747468305833 2023-06-27 10:23:54,256 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.30 vs. limit=15.0 2023-06-27 10:24:09,771 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1784892.0, ans=10.0 2023-06-27 10:24:13,194 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1784892.0, ans=0.2 2023-06-27 10:24:32,502 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1784952.0, ans=0.1 2023-06-27 10:24:46,696 INFO [train.py:996] (2/4) Epoch 10, batch 23050, loss[loss=0.2552, simple_loss=0.3276, pruned_loss=0.09139, over 21588.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.2983, pruned_loss=0.06834, over 4281707.80 frames. ], batch size: 389, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 10:24:50,999 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.79 vs. limit=22.5 2023-06-27 10:24:57,055 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1785012.0, ans=0.0 2023-06-27 10:25:02,720 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=12.75 vs. limit=15.0 2023-06-27 10:25:33,141 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.14 vs. limit=15.0 2023-06-27 10:25:37,401 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1785132.0, ans=0.0 2023-06-27 10:26:17,102 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.088e+02 5.488e+02 7.273e+02 1.121e+03 2.826e+03, threshold=1.455e+03, percent-clipped=6.0 2023-06-27 10:26:17,650 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1785252.0, ans=0.0 2023-06-27 10:26:26,892 INFO [train.py:996] (2/4) Epoch 10, batch 23100, loss[loss=0.1911, simple_loss=0.2618, pruned_loss=0.06026, over 21379.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2936, pruned_loss=0.06829, over 4273119.42 frames. ], batch size: 131, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 10:26:31,798 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1785312.0, ans=0.125 2023-06-27 10:26:35,317 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1785312.0, ans=0.0 2023-06-27 10:27:52,967 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 10:28:02,061 INFO [train.py:996] (2/4) Epoch 10, batch 23150, loss[loss=0.2432, simple_loss=0.3032, pruned_loss=0.09154, over 21249.00 frames. ], tot_loss[loss=0.2117, simple_loss=0.2875, pruned_loss=0.06793, over 4281092.35 frames. ], batch size: 159, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 10:28:50,789 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.42 vs. limit=22.5 2023-06-27 10:28:59,395 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1785792.0, ans=0.125 2023-06-27 10:29:25,878 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.152e+02 5.952e+02 7.532e+02 1.121e+03 2.900e+03, threshold=1.506e+03, percent-clipped=14.0 2023-06-27 10:29:28,202 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1785852.0, ans=0.125 2023-06-27 10:29:35,448 INFO [train.py:996] (2/4) Epoch 10, batch 23200, loss[loss=0.1999, simple_loss=0.2684, pruned_loss=0.06568, over 21663.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.287, pruned_loss=0.06888, over 4281366.37 frames. ], batch size: 263, lr: 2.90e-03, grad_scale: 32.0 2023-06-27 10:29:36,037 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1785912.0, ans=0.2 2023-06-27 10:30:05,805 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.94 vs. limit=12.0 2023-06-27 10:31:01,623 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1786152.0, ans=0.125 2023-06-27 10:31:10,865 INFO [train.py:996] (2/4) Epoch 10, batch 23250, loss[loss=0.2072, simple_loss=0.2698, pruned_loss=0.07223, over 21668.00 frames. ], tot_loss[loss=0.2129, simple_loss=0.2864, pruned_loss=0.06971, over 4288061.28 frames. ], batch size: 263, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 10:31:23,114 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1786212.0, ans=0.2 2023-06-27 10:31:40,349 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.41 vs. limit=15.0 2023-06-27 10:31:52,680 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1786332.0, ans=0.125 2023-06-27 10:32:44,614 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.481e+02 7.308e+02 1.025e+03 1.554e+03 3.146e+03, threshold=2.050e+03, percent-clipped=26.0 2023-06-27 10:32:52,914 INFO [train.py:996] (2/4) Epoch 10, batch 23300, loss[loss=0.3003, simple_loss=0.4011, pruned_loss=0.0998, over 21694.00 frames. ], tot_loss[loss=0.2199, simple_loss=0.2956, pruned_loss=0.07214, over 4291014.92 frames. ], batch size: 441, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 10:34:02,067 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1786692.0, ans=0.07 2023-06-27 10:34:04,064 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.49 vs. limit=15.0 2023-06-27 10:34:26,151 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1786752.0, ans=0.0 2023-06-27 10:34:33,877 INFO [train.py:996] (2/4) Epoch 10, batch 23350, loss[loss=0.187, simple_loss=0.2739, pruned_loss=0.05002, over 21672.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.2995, pruned_loss=0.07138, over 4287740.94 frames. ], batch size: 414, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 10:36:05,515 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.446e+02 7.072e+02 1.049e+03 1.355e+03 2.858e+03, threshold=2.098e+03, percent-clipped=8.0 2023-06-27 10:36:13,600 INFO [train.py:996] (2/4) Epoch 10, batch 23400, loss[loss=0.2202, simple_loss=0.2967, pruned_loss=0.07183, over 21735.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2925, pruned_loss=0.06776, over 4284527.24 frames. ], batch size: 112, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 10:37:25,736 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1787292.0, ans=0.04949747468305833 2023-06-27 10:37:31,048 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1787292.0, ans=0.125 2023-06-27 10:37:40,689 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1787352.0, ans=0.125 2023-06-27 10:37:54,848 INFO [train.py:996] (2/4) Epoch 10, batch 23450, loss[loss=0.2412, simple_loss=0.3089, pruned_loss=0.08673, over 21742.00 frames. ], tot_loss[loss=0.2156, simple_loss=0.2921, pruned_loss=0.06954, over 4284832.25 frames. ], batch size: 351, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 10:38:00,157 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1787412.0, ans=0.1 2023-06-27 10:38:52,754 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.54 vs. limit=15.0 2023-06-27 10:39:00,157 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1787592.0, ans=0.0 2023-06-27 10:39:25,537 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.955e+02 6.626e+02 1.004e+03 1.261e+03 2.377e+03, threshold=2.009e+03, percent-clipped=2.0 2023-06-27 10:39:38,049 INFO [train.py:996] (2/4) Epoch 10, batch 23500, loss[loss=0.2016, simple_loss=0.2628, pruned_loss=0.0702, over 21619.00 frames. ], tot_loss[loss=0.218, simple_loss=0.293, pruned_loss=0.07154, over 4285997.59 frames. ], batch size: 548, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 10:39:50,373 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.65 vs. limit=22.5 2023-06-27 10:40:57,214 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1787952.0, ans=0.0 2023-06-27 10:41:00,226 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1787952.0, ans=0.1 2023-06-27 10:41:01,709 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1787952.0, ans=0.0 2023-06-27 10:41:14,535 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1787952.0, ans=0.1 2023-06-27 10:41:17,178 INFO [train.py:996] (2/4) Epoch 10, batch 23550, loss[loss=0.1972, simple_loss=0.2678, pruned_loss=0.0633, over 21798.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2888, pruned_loss=0.07126, over 4280423.63 frames. ], batch size: 118, lr: 2.89e-03, grad_scale: 8.0 2023-06-27 10:42:41,404 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1788252.0, ans=0.0 2023-06-27 10:42:47,397 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.468e+02 6.896e+02 9.643e+02 1.434e+03 2.789e+03, threshold=1.929e+03, percent-clipped=7.0 2023-06-27 10:42:58,648 INFO [train.py:996] (2/4) Epoch 10, batch 23600, loss[loss=0.2945, simple_loss=0.3505, pruned_loss=0.1192, over 21326.00 frames. ], tot_loss[loss=0.2158, simple_loss=0.2893, pruned_loss=0.07118, over 4276366.77 frames. ], batch size: 508, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 10:43:17,656 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=8.40 vs. limit=15.0 2023-06-27 10:43:45,166 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1788372.0, ans=0.0 2023-06-27 10:44:14,038 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1788492.0, ans=0.125 2023-06-27 10:44:46,995 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.71 vs. limit=22.5 2023-06-27 10:44:47,508 INFO [train.py:996] (2/4) Epoch 10, batch 23650, loss[loss=0.2161, simple_loss=0.3008, pruned_loss=0.06572, over 21903.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.2893, pruned_loss=0.06927, over 4282830.19 frames. ], batch size: 316, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 10:44:57,476 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1788612.0, ans=0.0 2023-06-27 10:45:24,770 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1788672.0, ans=0.1 2023-06-27 10:45:47,556 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1788792.0, ans=0.1 2023-06-27 10:46:27,253 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.326e+02 5.698e+02 8.154e+02 1.096e+03 2.339e+03, threshold=1.631e+03, percent-clipped=3.0 2023-06-27 10:46:38,535 INFO [train.py:996] (2/4) Epoch 10, batch 23700, loss[loss=0.2187, simple_loss=0.3032, pruned_loss=0.0671, over 21304.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2925, pruned_loss=0.06911, over 4281346.67 frames. ], batch size: 549, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 10:46:49,127 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1788912.0, ans=0.05 2023-06-27 10:46:50,739 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1788912.0, ans=0.2 2023-06-27 10:46:50,745 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1788912.0, ans=0.0 2023-06-27 10:46:51,211 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.67 vs. limit=12.0 2023-06-27 10:47:20,287 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1789032.0, ans=0.0 2023-06-27 10:47:22,305 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1789032.0, ans=0.0 2023-06-27 10:48:19,723 INFO [train.py:996] (2/4) Epoch 10, batch 23750, loss[loss=0.2078, simple_loss=0.3075, pruned_loss=0.05402, over 21581.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.2939, pruned_loss=0.0692, over 4275996.09 frames. ], batch size: 414, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 10:49:00,003 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1789332.0, ans=0.04949747468305833 2023-06-27 10:49:18,094 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1789392.0, ans=0.1 2023-06-27 10:49:23,156 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1789392.0, ans=0.1 2023-06-27 10:49:51,122 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1789452.0, ans=0.125 2023-06-27 10:49:55,488 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.670e+02 6.081e+02 7.830e+02 1.141e+03 2.559e+03, threshold=1.566e+03, percent-clipped=8.0 2023-06-27 10:50:02,087 INFO [train.py:996] (2/4) Epoch 10, batch 23800, loss[loss=0.2629, simple_loss=0.3815, pruned_loss=0.07214, over 19927.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2921, pruned_loss=0.06738, over 4277281.92 frames. ], batch size: 702, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 10:50:07,152 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.84 vs. limit=15.0 2023-06-27 10:50:49,047 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1789632.0, ans=0.95 2023-06-27 10:51:45,230 INFO [train.py:996] (2/4) Epoch 10, batch 23850, loss[loss=0.237, simple_loss=0.3153, pruned_loss=0.07935, over 21594.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.2998, pruned_loss=0.06902, over 4274195.96 frames. ], batch size: 389, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 10:52:06,441 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.55 vs. limit=22.5 2023-06-27 10:52:37,816 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.14 vs. limit=15.0 2023-06-27 10:52:54,794 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1789992.0, ans=0.1 2023-06-27 10:53:06,121 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1789992.0, ans=0.125 2023-06-27 10:53:18,318 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.153e+02 6.860e+02 1.142e+03 1.790e+03 3.579e+03, threshold=2.285e+03, percent-clipped=29.0 2023-06-27 10:53:24,735 INFO [train.py:996] (2/4) Epoch 10, batch 23900, loss[loss=0.2069, simple_loss=0.2664, pruned_loss=0.0737, over 20264.00 frames. ], tot_loss[loss=0.2215, simple_loss=0.3031, pruned_loss=0.07, over 4275469.43 frames. ], batch size: 703, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 10:53:44,715 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1790172.0, ans=0.0 2023-06-27 10:54:24,701 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1790232.0, ans=0.0 2023-06-27 10:54:58,015 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1790352.0, ans=0.0 2023-06-27 10:55:05,883 INFO [train.py:996] (2/4) Epoch 10, batch 23950, loss[loss=0.165, simple_loss=0.2346, pruned_loss=0.04768, over 21589.00 frames. ], tot_loss[loss=0.221, simple_loss=0.3003, pruned_loss=0.07082, over 4263932.62 frames. ], batch size: 231, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 10:55:09,928 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1790412.0, ans=0.1 2023-06-27 10:55:18,383 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1790412.0, ans=0.125 2023-06-27 10:56:40,582 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.159e+02 7.309e+02 9.584e+02 1.406e+03 2.703e+03, threshold=1.917e+03, percent-clipped=3.0 2023-06-27 10:56:47,121 INFO [train.py:996] (2/4) Epoch 10, batch 24000, loss[loss=0.218, simple_loss=0.2933, pruned_loss=0.07129, over 21813.00 frames. ], tot_loss[loss=0.224, simple_loss=0.3014, pruned_loss=0.07329, over 4269701.53 frames. ], batch size: 282, lr: 2.89e-03, grad_scale: 32.0 2023-06-27 10:56:47,121 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-27 10:57:07,144 INFO [train.py:1028] (2/4) Epoch 10, validation: loss=0.2621, simple_loss=0.3549, pruned_loss=0.08461, over 1796401.00 frames. 2023-06-27 10:57:07,145 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 23731MB 2023-06-27 10:57:30,102 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1790772.0, ans=0.125 2023-06-27 10:57:47,239 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.18 vs. limit=15.0 2023-06-27 10:58:10,159 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1790892.0, ans=0.2 2023-06-27 10:58:27,053 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.18 vs. limit=15.0 2023-06-27 10:58:45,463 INFO [train.py:996] (2/4) Epoch 10, batch 24050, loss[loss=0.2111, simple_loss=0.3016, pruned_loss=0.06032, over 21748.00 frames. ], tot_loss[loss=0.226, simple_loss=0.304, pruned_loss=0.07405, over 4269856.01 frames. ], batch size: 332, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 10:59:27,418 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1791132.0, ans=0.0 2023-06-27 10:59:36,008 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.59 vs. limit=12.0 2023-06-27 10:59:40,361 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1791192.0, ans=0.125 2023-06-27 11:00:21,996 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.514e+02 5.749e+02 8.023e+02 1.325e+03 2.806e+03, threshold=1.605e+03, percent-clipped=11.0 2023-06-27 11:00:32,047 INFO [train.py:996] (2/4) Epoch 10, batch 24100, loss[loss=0.2155, simple_loss=0.3071, pruned_loss=0.06193, over 21693.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.3034, pruned_loss=0.0722, over 4269926.12 frames. ], batch size: 298, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:01:03,597 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1791432.0, ans=0.125 2023-06-27 11:01:04,327 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.57 vs. limit=6.0 2023-06-27 11:01:05,194 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1791432.0, ans=0.125 2023-06-27 11:01:11,744 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1791432.0, ans=0.2 2023-06-27 11:01:24,800 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1791492.0, ans=0.1 2023-06-27 11:01:40,507 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.47 vs. limit=10.0 2023-06-27 11:02:13,404 INFO [train.py:996] (2/4) Epoch 10, batch 24150, loss[loss=0.2187, simple_loss=0.2926, pruned_loss=0.07242, over 21323.00 frames. ], tot_loss[loss=0.225, simple_loss=0.3033, pruned_loss=0.07337, over 4272421.71 frames. ], batch size: 159, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:02:46,809 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1791732.0, ans=0.2 2023-06-27 11:02:56,995 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.87 vs. limit=12.0 2023-06-27 11:03:16,426 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.06 vs. limit=15.0 2023-06-27 11:03:45,388 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.503e+02 6.384e+02 9.147e+02 1.297e+03 2.622e+03, threshold=1.829e+03, percent-clipped=12.0 2023-06-27 11:03:50,487 INFO [train.py:996] (2/4) Epoch 10, batch 24200, loss[loss=0.2474, simple_loss=0.3271, pruned_loss=0.08391, over 21701.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.3055, pruned_loss=0.07455, over 4271831.47 frames. ], batch size: 351, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:04:04,453 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1791912.0, ans=0.125 2023-06-27 11:04:16,560 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.51 vs. limit=15.0 2023-06-27 11:05:33,112 INFO [train.py:996] (2/4) Epoch 10, batch 24250, loss[loss=0.2328, simple_loss=0.3455, pruned_loss=0.06001, over 19902.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.3023, pruned_loss=0.06891, over 4276295.11 frames. ], batch size: 702, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:05:58,368 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1792272.0, ans=0.125 2023-06-27 11:06:00,116 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1792272.0, ans=0.1 2023-06-27 11:06:27,551 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1792332.0, ans=0.5 2023-06-27 11:06:51,986 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1792392.0, ans=0.0 2023-06-27 11:07:05,106 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1792452.0, ans=0.125 2023-06-27 11:07:09,243 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.373e+02 5.833e+02 9.026e+02 1.321e+03 2.992e+03, threshold=1.805e+03, percent-clipped=10.0 2023-06-27 11:07:14,013 INFO [train.py:996] (2/4) Epoch 10, batch 24300, loss[loss=0.1722, simple_loss=0.2432, pruned_loss=0.05066, over 21706.00 frames. ], tot_loss[loss=0.2105, simple_loss=0.2948, pruned_loss=0.06311, over 4278089.93 frames. ], batch size: 112, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:07:19,574 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1792512.0, ans=0.0 2023-06-27 11:07:52,131 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1792632.0, ans=0.2 2023-06-27 11:08:27,310 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1792692.0, ans=0.0 2023-06-27 11:08:55,411 INFO [train.py:996] (2/4) Epoch 10, batch 24350, loss[loss=0.2094, simple_loss=0.2816, pruned_loss=0.06861, over 21653.00 frames. ], tot_loss[loss=0.2095, simple_loss=0.2919, pruned_loss=0.06351, over 4276275.71 frames. ], batch size: 263, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:10:15,403 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1793052.0, ans=0.125 2023-06-27 11:10:22,222 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.03 vs. limit=22.5 2023-06-27 11:10:27,624 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.762e+02 6.342e+02 9.950e+02 1.336e+03 3.105e+03, threshold=1.990e+03, percent-clipped=13.0 2023-06-27 11:10:28,103 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1793052.0, ans=0.125 2023-06-27 11:10:32,377 INFO [train.py:996] (2/4) Epoch 10, batch 24400, loss[loss=0.2559, simple_loss=0.3271, pruned_loss=0.09235, over 21818.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2953, pruned_loss=0.06633, over 4268698.20 frames. ], batch size: 441, lr: 2.89e-03, grad_scale: 32.0 2023-06-27 11:10:41,033 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1793112.0, ans=0.125 2023-06-27 11:11:03,870 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1793172.0, ans=0.125 2023-06-27 11:11:19,134 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.40 vs. limit=15.0 2023-06-27 11:11:21,328 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1793232.0, ans=0.1 2023-06-27 11:11:38,711 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.39 vs. limit=15.0 2023-06-27 11:11:49,742 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1793292.0, ans=0.1 2023-06-27 11:12:01,849 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1793352.0, ans=0.125 2023-06-27 11:12:14,398 INFO [train.py:996] (2/4) Epoch 10, batch 24450, loss[loss=0.2098, simple_loss=0.2744, pruned_loss=0.07255, over 20101.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2979, pruned_loss=0.06821, over 4268449.75 frames. ], batch size: 707, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:13:12,836 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1793532.0, ans=0.125 2023-06-27 11:13:46,517 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1793652.0, ans=0.125 2023-06-27 11:13:50,877 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.345e+02 6.636e+02 9.241e+02 1.233e+03 3.193e+03, threshold=1.848e+03, percent-clipped=3.0 2023-06-27 11:13:51,877 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1793652.0, ans=0.0 2023-06-27 11:13:54,213 INFO [train.py:996] (2/4) Epoch 10, batch 24500, loss[loss=0.2202, simple_loss=0.2963, pruned_loss=0.07211, over 21715.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2986, pruned_loss=0.06869, over 4271898.87 frames. ], batch size: 389, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:14:01,628 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.69 vs. limit=22.5 2023-06-27 11:14:10,503 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1793712.0, ans=0.0 2023-06-27 11:14:12,081 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1793712.0, ans=0.0 2023-06-27 11:14:26,965 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1793772.0, ans=0.125 2023-06-27 11:15:04,726 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1793892.0, ans=0.0 2023-06-27 11:15:14,524 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1793892.0, ans=0.125 2023-06-27 11:15:24,279 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=1793952.0, ans=0.025 2023-06-27 11:15:40,102 INFO [train.py:996] (2/4) Epoch 10, batch 24550, loss[loss=0.2376, simple_loss=0.3201, pruned_loss=0.07759, over 21535.00 frames. ], tot_loss[loss=0.222, simple_loss=0.3021, pruned_loss=0.07095, over 4278138.83 frames. ], batch size: 194, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:15:43,794 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 11:16:07,514 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1794072.0, ans=0.04949747468305833 2023-06-27 11:16:26,836 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1794132.0, ans=0.2 2023-06-27 11:16:31,900 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1794132.0, ans=0.0 2023-06-27 11:16:35,432 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1794132.0, ans=0.125 2023-06-27 11:16:35,482 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1794132.0, ans=0.2 2023-06-27 11:17:16,637 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.256e+02 6.446e+02 9.214e+02 1.322e+03 3.260e+03, threshold=1.843e+03, percent-clipped=13.0 2023-06-27 11:17:19,812 INFO [train.py:996] (2/4) Epoch 10, batch 24600, loss[loss=0.2134, simple_loss=0.3442, pruned_loss=0.04127, over 19844.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.2987, pruned_loss=0.07096, over 4276242.22 frames. ], batch size: 702, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:17:49,298 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1794372.0, ans=0.0 2023-06-27 11:17:59,391 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1794372.0, ans=0.1 2023-06-27 11:18:17,538 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1794492.0, ans=0.125 2023-06-27 11:18:31,507 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.75 vs. limit=15.0 2023-06-27 11:19:06,370 INFO [train.py:996] (2/4) Epoch 10, batch 24650, loss[loss=0.1758, simple_loss=0.2456, pruned_loss=0.053, over 21587.00 frames. ], tot_loss[loss=0.215, simple_loss=0.2913, pruned_loss=0.0693, over 4274626.26 frames. ], batch size: 263, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:20:05,665 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1794792.0, ans=0.125 2023-06-27 11:20:15,574 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=1794852.0, ans=15.0 2023-06-27 11:20:38,903 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.116e+02 6.411e+02 8.563e+02 1.154e+03 3.780e+03, threshold=1.713e+03, percent-clipped=12.0 2023-06-27 11:20:42,281 INFO [train.py:996] (2/4) Epoch 10, batch 24700, loss[loss=0.1716, simple_loss=0.2272, pruned_loss=0.058, over 20819.00 frames. ], tot_loss[loss=0.2121, simple_loss=0.2881, pruned_loss=0.0681, over 4271329.22 frames. ], batch size: 609, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:22:16,082 INFO [train.py:996] (2/4) Epoch 10, batch 24750, loss[loss=0.1854, simple_loss=0.2581, pruned_loss=0.05635, over 21627.00 frames. ], tot_loss[loss=0.2068, simple_loss=0.2817, pruned_loss=0.06593, over 4271142.77 frames. ], batch size: 298, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:22:27,781 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1795212.0, ans=0.125 2023-06-27 11:22:44,086 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1795272.0, ans=0.0 2023-06-27 11:22:48,963 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1795272.0, ans=0.125 2023-06-27 11:22:57,367 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1795332.0, ans=0.125 2023-06-27 11:23:05,274 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 11:23:35,835 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1795452.0, ans=0.125 2023-06-27 11:23:37,673 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1795452.0, ans=0.125 2023-06-27 11:23:38,634 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.117e+02 6.853e+02 9.586e+02 1.478e+03 3.032e+03, threshold=1.917e+03, percent-clipped=13.0 2023-06-27 11:23:46,723 INFO [train.py:996] (2/4) Epoch 10, batch 24800, loss[loss=0.2163, simple_loss=0.2927, pruned_loss=0.06995, over 21846.00 frames. ], tot_loss[loss=0.2033, simple_loss=0.2762, pruned_loss=0.0652, over 4278908.45 frames. ], batch size: 124, lr: 2.89e-03, grad_scale: 32.0 2023-06-27 11:23:54,122 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1795512.0, ans=0.0 2023-06-27 11:24:55,562 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1795692.0, ans=0.125 2023-06-27 11:25:29,261 INFO [train.py:996] (2/4) Epoch 10, batch 24850, loss[loss=0.2173, simple_loss=0.2973, pruned_loss=0.06863, over 21864.00 frames. ], tot_loss[loss=0.2057, simple_loss=0.2776, pruned_loss=0.06692, over 4280737.27 frames. ], batch size: 371, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:25:32,402 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=10.09 vs. limit=15.0 2023-06-27 11:25:38,570 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.63 vs. limit=15.0 2023-06-27 11:25:41,322 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1795812.0, ans=0.125 2023-06-27 11:26:21,911 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=1795992.0, ans=6.0 2023-06-27 11:26:59,062 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.781e+02 6.964e+02 9.660e+02 1.513e+03 3.423e+03, threshold=1.932e+03, percent-clipped=14.0 2023-06-27 11:27:00,605 INFO [train.py:996] (2/4) Epoch 10, batch 24900, loss[loss=0.1948, simple_loss=0.2488, pruned_loss=0.07043, over 20182.00 frames. ], tot_loss[loss=0.2065, simple_loss=0.279, pruned_loss=0.06701, over 4284812.85 frames. ], batch size: 703, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:28:24,191 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_na.min_abs, batch_count=1796352.0, ans=0.02 2023-06-27 11:28:35,786 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1796352.0, ans=0.125 2023-06-27 11:28:41,467 INFO [train.py:996] (2/4) Epoch 10, batch 24950, loss[loss=0.2566, simple_loss=0.3205, pruned_loss=0.0963, over 21390.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2876, pruned_loss=0.07048, over 4282883.44 frames. ], batch size: 549, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:29:08,598 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1796472.0, ans=0.125 2023-06-27 11:29:16,377 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1796472.0, ans=0.1 2023-06-27 11:29:16,452 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_na.min_abs, batch_count=1796472.0, ans=0.02 2023-06-27 11:29:29,474 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1796532.0, ans=0.2 2023-06-27 11:30:18,597 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1796652.0, ans=0.0 2023-06-27 11:30:19,511 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.654e+02 6.926e+02 9.542e+02 1.348e+03 3.788e+03, threshold=1.908e+03, percent-clipped=7.0 2023-06-27 11:30:20,992 INFO [train.py:996] (2/4) Epoch 10, batch 25000, loss[loss=0.1892, simple_loss=0.2629, pruned_loss=0.05774, over 21649.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.293, pruned_loss=0.07218, over 4276635.76 frames. ], batch size: 332, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:30:38,328 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.11 vs. limit=15.0 2023-06-27 11:30:39,471 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1796712.0, ans=0.125 2023-06-27 11:30:52,986 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.74 vs. limit=12.0 2023-06-27 11:30:55,068 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.15 vs. limit=8.0 2023-06-27 11:31:03,040 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.44 vs. limit=15.0 2023-06-27 11:31:40,086 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1796892.0, ans=0.0 2023-06-27 11:31:42,004 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1796892.0, ans=0.125 2023-06-27 11:31:51,603 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1796952.0, ans=0.0 2023-06-27 11:32:07,426 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1796952.0, ans=0.125 2023-06-27 11:32:11,881 INFO [train.py:996] (2/4) Epoch 10, batch 25050, loss[loss=0.1999, simple_loss=0.2662, pruned_loss=0.06679, over 21590.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2876, pruned_loss=0.07131, over 4270691.13 frames. ], batch size: 415, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:32:37,156 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.05 vs. limit=6.0 2023-06-27 11:32:49,369 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1797132.0, ans=0.0 2023-06-27 11:33:20,606 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.01 vs. limit=8.0 2023-06-27 11:33:50,162 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.906e+02 5.494e+02 7.889e+02 1.087e+03 2.340e+03, threshold=1.578e+03, percent-clipped=4.0 2023-06-27 11:33:51,535 INFO [train.py:996] (2/4) Epoch 10, batch 25100, loss[loss=0.1894, simple_loss=0.2545, pruned_loss=0.0621, over 21526.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2818, pruned_loss=0.06951, over 4278462.29 frames. ], batch size: 391, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:34:03,356 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1797312.0, ans=0.0 2023-06-27 11:34:04,954 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1797312.0, ans=0.0 2023-06-27 11:34:22,461 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1797432.0, ans=0.1 2023-06-27 11:35:14,042 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1797552.0, ans=0.1 2023-06-27 11:35:18,973 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1797552.0, ans=0.1 2023-06-27 11:35:18,984 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1797552.0, ans=0.2 2023-06-27 11:35:26,555 INFO [train.py:996] (2/4) Epoch 10, batch 25150, loss[loss=0.197, simple_loss=0.2872, pruned_loss=0.05344, over 21405.00 frames. ], tot_loss[loss=0.21, simple_loss=0.2848, pruned_loss=0.06761, over 4262992.54 frames. ], batch size: 211, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:35:52,579 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1797672.0, ans=0.125 2023-06-27 11:36:59,152 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1797852.0, ans=0.0 2023-06-27 11:37:04,859 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.938e+02 6.755e+02 1.265e+03 1.654e+03 3.292e+03, threshold=2.530e+03, percent-clipped=31.0 2023-06-27 11:37:06,429 INFO [train.py:996] (2/4) Epoch 10, batch 25200, loss[loss=0.195, simple_loss=0.2512, pruned_loss=0.06945, over 20247.00 frames. ], tot_loss[loss=0.2095, simple_loss=0.2861, pruned_loss=0.06649, over 4265859.40 frames. ], batch size: 703, lr: 2.89e-03, grad_scale: 32.0 2023-06-27 11:38:21,564 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1798092.0, ans=0.1 2023-06-27 11:38:46,163 INFO [train.py:996] (2/4) Epoch 10, batch 25250, loss[loss=0.1851, simple_loss=0.2513, pruned_loss=0.05949, over 21230.00 frames. ], tot_loss[loss=0.2071, simple_loss=0.2839, pruned_loss=0.06513, over 4267204.08 frames. ], batch size: 176, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:39:55,170 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1798392.0, ans=0.125 2023-06-27 11:40:13,571 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1798452.0, ans=0.04949747468305833 2023-06-27 11:40:19,940 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1798452.0, ans=0.0 2023-06-27 11:40:20,453 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=15.56 vs. limit=15.0 2023-06-27 11:40:32,592 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.222e+02 7.212e+02 1.026e+03 1.530e+03 2.488e+03, threshold=2.053e+03, percent-clipped=0.0 2023-06-27 11:40:32,623 INFO [train.py:996] (2/4) Epoch 10, batch 25300, loss[loss=0.2032, simple_loss=0.2826, pruned_loss=0.06191, over 21782.00 frames. ], tot_loss[loss=0.2056, simple_loss=0.282, pruned_loss=0.06457, over 4270111.02 frames. ], batch size: 282, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:40:57,736 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1798572.0, ans=0.125 2023-06-27 11:42:13,849 INFO [train.py:996] (2/4) Epoch 10, batch 25350, loss[loss=0.2055, simple_loss=0.2853, pruned_loss=0.06288, over 21819.00 frames. ], tot_loss[loss=0.2059, simple_loss=0.2835, pruned_loss=0.06415, over 4265956.73 frames. ], batch size: 107, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:42:19,439 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1798812.0, ans=0.125 2023-06-27 11:42:19,942 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.52 vs. limit=15.0 2023-06-27 11:42:32,575 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1798872.0, ans=0.125 2023-06-27 11:42:54,446 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1798932.0, ans=0.015 2023-06-27 11:43:09,346 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1798932.0, ans=0.125 2023-06-27 11:43:24,779 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1798992.0, ans=0.125 2023-06-27 11:43:35,758 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1799052.0, ans=0.0 2023-06-27 11:43:53,123 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.945e+02 6.520e+02 8.858e+02 1.308e+03 2.699e+03, threshold=1.772e+03, percent-clipped=4.0 2023-06-27 11:43:53,154 INFO [train.py:996] (2/4) Epoch 10, batch 25400, loss[loss=0.1851, simple_loss=0.2792, pruned_loss=0.04555, over 21623.00 frames. ], tot_loss[loss=0.2033, simple_loss=0.2801, pruned_loss=0.06329, over 4245270.49 frames. ], batch size: 263, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:44:12,109 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1799172.0, ans=0.95 2023-06-27 11:44:15,228 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1799172.0, ans=0.125 2023-06-27 11:44:46,045 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1799232.0, ans=0.0 2023-06-27 11:44:58,951 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1799292.0, ans=0.0 2023-06-27 11:45:00,349 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1799292.0, ans=0.125 2023-06-27 11:45:02,198 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1799292.0, ans=0.125 2023-06-27 11:45:15,289 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1799352.0, ans=0.0 2023-06-27 11:45:34,082 INFO [train.py:996] (2/4) Epoch 10, batch 25450, loss[loss=0.1839, simple_loss=0.268, pruned_loss=0.04988, over 21496.00 frames. ], tot_loss[loss=0.2037, simple_loss=0.2794, pruned_loss=0.06399, over 4254104.72 frames. ], batch size: 131, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 11:45:39,815 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1799412.0, ans=0.125 2023-06-27 11:45:41,476 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1799412.0, ans=0.09899494936611666 2023-06-27 11:46:11,167 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 11:46:48,700 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1799592.0, ans=0.2 2023-06-27 11:47:16,317 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.168e+02 6.039e+02 8.121e+02 1.135e+03 2.521e+03, threshold=1.624e+03, percent-clipped=2.0 2023-06-27 11:47:16,349 INFO [train.py:996] (2/4) Epoch 10, batch 25500, loss[loss=0.187, simple_loss=0.2727, pruned_loss=0.05063, over 21293.00 frames. ], tot_loss[loss=0.2023, simple_loss=0.2804, pruned_loss=0.06206, over 4254990.20 frames. ], batch size: 176, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 11:48:17,275 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.77 vs. limit=22.5 2023-06-27 11:48:45,847 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1799952.0, ans=0.0 2023-06-27 11:48:58,448 INFO [train.py:996] (2/4) Epoch 10, batch 25550, loss[loss=0.2404, simple_loss=0.3406, pruned_loss=0.07007, over 21708.00 frames. ], tot_loss[loss=0.2052, simple_loss=0.2866, pruned_loss=0.06186, over 4243612.75 frames. ], batch size: 441, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 11:49:16,646 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1800012.0, ans=0.1 2023-06-27 11:50:20,286 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1800192.0, ans=0.125 2023-06-27 11:50:30,309 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1800252.0, ans=0.125 2023-06-27 11:50:31,802 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1800252.0, ans=0.125 2023-06-27 11:50:32,243 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.01 vs. limit=15.0 2023-06-27 11:50:33,304 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1800252.0, ans=0.1 2023-06-27 11:50:39,238 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.324e+02 5.995e+02 1.017e+03 1.623e+03 5.096e+03, threshold=2.035e+03, percent-clipped=24.0 2023-06-27 11:50:39,270 INFO [train.py:996] (2/4) Epoch 10, batch 25600, loss[loss=0.253, simple_loss=0.3241, pruned_loss=0.09092, over 21604.00 frames. ], tot_loss[loss=0.2078, simple_loss=0.2905, pruned_loss=0.06255, over 4232444.12 frames. ], batch size: 414, lr: 2.88e-03, grad_scale: 32.0 2023-06-27 11:51:08,895 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1800372.0, ans=0.2 2023-06-27 11:51:10,588 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1800372.0, ans=0.125 2023-06-27 11:51:12,179 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 11:51:34,287 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1800432.0, ans=0.125 2023-06-27 11:52:19,296 INFO [train.py:996] (2/4) Epoch 10, batch 25650, loss[loss=0.1946, simple_loss=0.2595, pruned_loss=0.06481, over 21335.00 frames. ], tot_loss[loss=0.2098, simple_loss=0.2906, pruned_loss=0.06449, over 4238729.02 frames. ], batch size: 211, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 11:52:20,131 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1800612.0, ans=0.0 2023-06-27 11:52:34,263 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1800612.0, ans=0.125 2023-06-27 11:52:39,617 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1800672.0, ans=0.1 2023-06-27 11:54:00,587 INFO [train.py:996] (2/4) Epoch 10, batch 25700, loss[loss=0.1959, simple_loss=0.2802, pruned_loss=0.05581, over 21784.00 frames. ], tot_loss[loss=0.2096, simple_loss=0.2881, pruned_loss=0.06555, over 4239881.55 frames. ], batch size: 282, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 11:54:06,859 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.076e+02 8.398e+02 1.386e+03 2.056e+03 4.305e+03, threshold=2.773e+03, percent-clipped=25.0 2023-06-27 11:54:18,930 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1800912.0, ans=0.125 2023-06-27 11:54:38,035 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1800972.0, ans=0.125 2023-06-27 11:54:51,492 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1801032.0, ans=0.0 2023-06-27 11:54:52,952 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1801032.0, ans=0.2 2023-06-27 11:55:09,148 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1801092.0, ans=0.125 2023-06-27 11:55:40,896 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 11:55:43,005 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.11 vs. limit=15.0 2023-06-27 11:55:46,719 INFO [train.py:996] (2/4) Epoch 10, batch 25750, loss[loss=0.2693, simple_loss=0.3662, pruned_loss=0.08616, over 21909.00 frames. ], tot_loss[loss=0.2149, simple_loss=0.2931, pruned_loss=0.06839, over 4243457.86 frames. ], batch size: 316, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 11:55:52,816 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.99 vs. limit=6.0 2023-06-27 11:56:58,323 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1801392.0, ans=0.125 2023-06-27 11:57:18,420 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1801452.0, ans=0.125 2023-06-27 11:57:26,979 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1801452.0, ans=0.0 2023-06-27 11:57:39,572 INFO [train.py:996] (2/4) Epoch 10, batch 25800, loss[loss=0.2301, simple_loss=0.3131, pruned_loss=0.07355, over 21432.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.3041, pruned_loss=0.07319, over 4251408.68 frames. ], batch size: 211, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 11:57:41,427 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.492e+02 7.009e+02 1.091e+03 1.518e+03 3.688e+03, threshold=2.182e+03, percent-clipped=4.0 2023-06-27 11:57:51,989 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 11:58:21,519 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1801632.0, ans=0.1 2023-06-27 11:59:00,517 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1801692.0, ans=0.1 2023-06-27 11:59:23,893 INFO [train.py:996] (2/4) Epoch 10, batch 25850, loss[loss=0.2279, simple_loss=0.2938, pruned_loss=0.08099, over 21780.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.3053, pruned_loss=0.07228, over 4260867.90 frames. ], batch size: 441, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:00:05,973 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1801932.0, ans=0.0 2023-06-27 12:00:32,966 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1801992.0, ans=0.125 2023-06-27 12:01:07,058 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1802052.0, ans=0.125 2023-06-27 12:01:11,685 INFO [train.py:996] (2/4) Epoch 10, batch 25900, loss[loss=0.2585, simple_loss=0.315, pruned_loss=0.101, over 21637.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.309, pruned_loss=0.07404, over 4273172.23 frames. ], batch size: 471, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:01:12,308 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1802112.0, ans=0.125 2023-06-27 12:01:13,359 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.315e+02 6.367e+02 8.577e+02 1.335e+03 4.211e+03, threshold=1.715e+03, percent-clipped=7.0 2023-06-27 12:01:32,169 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1802172.0, ans=0.0 2023-06-27 12:01:52,996 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=1802232.0, ans=0.025 2023-06-27 12:02:53,658 INFO [train.py:996] (2/4) Epoch 10, batch 25950, loss[loss=0.2604, simple_loss=0.3283, pruned_loss=0.09628, over 21582.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3148, pruned_loss=0.07678, over 4274465.87 frames. ], batch size: 389, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:03:07,712 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.93 vs. limit=15.0 2023-06-27 12:03:28,516 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff2.min_abs, batch_count=1802472.0, ans=0.1 2023-06-27 12:03:40,335 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.26 vs. limit=12.0 2023-06-27 12:03:59,296 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1802592.0, ans=0.0 2023-06-27 12:04:16,195 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1802592.0, ans=0.125 2023-06-27 12:04:29,494 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1802652.0, ans=0.125 2023-06-27 12:04:35,409 INFO [train.py:996] (2/4) Epoch 10, batch 26000, loss[loss=0.2411, simple_loss=0.3274, pruned_loss=0.07739, over 21709.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3156, pruned_loss=0.07556, over 4269033.75 frames. ], batch size: 351, lr: 2.88e-03, grad_scale: 32.0 2023-06-27 12:04:37,160 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.008e+02 6.185e+02 7.875e+02 1.125e+03 3.104e+03, threshold=1.575e+03, percent-clipped=8.0 2023-06-27 12:05:07,616 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.25 vs. limit=6.0 2023-06-27 12:06:02,491 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1802952.0, ans=0.1 2023-06-27 12:06:10,409 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1802952.0, ans=0.2 2023-06-27 12:06:16,105 INFO [train.py:996] (2/4) Epoch 10, batch 26050, loss[loss=0.2159, simple_loss=0.2853, pruned_loss=0.07321, over 21853.00 frames. ], tot_loss[loss=0.233, simple_loss=0.3148, pruned_loss=0.07554, over 4268106.81 frames. ], batch size: 298, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:06:21,554 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1803012.0, ans=0.125 2023-06-27 12:06:42,187 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1803072.0, ans=0.125 2023-06-27 12:06:47,129 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1803072.0, ans=0.015 2023-06-27 12:07:12,747 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1803192.0, ans=0.125 2023-06-27 12:07:44,605 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1803252.0, ans=0.0 2023-06-27 12:07:50,610 INFO [train.py:996] (2/4) Epoch 10, batch 26100, loss[loss=0.2334, simple_loss=0.3133, pruned_loss=0.07672, over 21773.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.3078, pruned_loss=0.07421, over 4277174.46 frames. ], batch size: 112, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:07:53,723 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.444e+02 6.064e+02 8.418e+02 1.151e+03 2.910e+03, threshold=1.684e+03, percent-clipped=10.0 2023-06-27 12:08:29,851 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.27 vs. limit=15.0 2023-06-27 12:09:09,994 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1803492.0, ans=0.125 2023-06-27 12:09:13,315 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1803552.0, ans=0.1 2023-06-27 12:09:30,847 INFO [train.py:996] (2/4) Epoch 10, batch 26150, loss[loss=0.2006, simple_loss=0.2755, pruned_loss=0.06282, over 21801.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.3043, pruned_loss=0.07441, over 4282000.46 frames. ], batch size: 247, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:10:10,761 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.94 vs. limit=6.0 2023-06-27 12:10:13,736 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 12:11:15,772 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1803912.0, ans=0.125 2023-06-27 12:11:16,870 INFO [train.py:996] (2/4) Epoch 10, batch 26200, loss[loss=0.2454, simple_loss=0.3454, pruned_loss=0.07271, over 21632.00 frames. ], tot_loss[loss=0.2248, simple_loss=0.3048, pruned_loss=0.07247, over 4284329.86 frames. ], batch size: 389, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:11:20,491 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.988e+02 7.097e+02 1.092e+03 1.637e+03 2.606e+03, threshold=2.184e+03, percent-clipped=21.0 2023-06-27 12:11:56,659 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1804032.0, ans=0.125 2023-06-27 12:12:44,875 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1804152.0, ans=0.125 2023-06-27 12:12:56,973 INFO [train.py:996] (2/4) Epoch 10, batch 26250, loss[loss=0.2424, simple_loss=0.3222, pruned_loss=0.08133, over 21916.00 frames. ], tot_loss[loss=0.226, simple_loss=0.3084, pruned_loss=0.07181, over 4278226.03 frames. ], batch size: 124, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:13:26,249 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1804272.0, ans=0.125 2023-06-27 12:13:29,664 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1804272.0, ans=0.2 2023-06-27 12:13:42,651 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1804332.0, ans=0.2 2023-06-27 12:13:50,875 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.87 vs. limit=15.0 2023-06-27 12:14:13,221 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1804452.0, ans=0.125 2023-06-27 12:14:36,326 INFO [train.py:996] (2/4) Epoch 10, batch 26300, loss[loss=0.2276, simple_loss=0.302, pruned_loss=0.07657, over 21404.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.3058, pruned_loss=0.07231, over 4282837.65 frames. ], batch size: 144, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:14:39,661 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.217e+02 5.994e+02 7.746e+02 1.132e+03 2.553e+03, threshold=1.549e+03, percent-clipped=2.0 2023-06-27 12:15:06,290 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1804572.0, ans=0.125 2023-06-27 12:15:37,303 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1804692.0, ans=0.0 2023-06-27 12:15:56,828 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1804752.0, ans=0.0 2023-06-27 12:16:16,767 INFO [train.py:996] (2/4) Epoch 10, batch 26350, loss[loss=0.2656, simple_loss=0.335, pruned_loss=0.09806, over 21785.00 frames. ], tot_loss[loss=0.2248, simple_loss=0.3039, pruned_loss=0.07287, over 4285168.71 frames. ], batch size: 441, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:16:35,806 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.89 vs. limit=10.0 2023-06-27 12:16:38,811 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1804872.0, ans=0.125 2023-06-27 12:17:51,991 INFO [train.py:996] (2/4) Epoch 10, batch 26400, loss[loss=0.181, simple_loss=0.2498, pruned_loss=0.05607, over 21762.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.2997, pruned_loss=0.07341, over 4270557.50 frames. ], batch size: 351, lr: 2.88e-03, grad_scale: 32.0 2023-06-27 12:17:52,639 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1805112.0, ans=0.0 2023-06-27 12:17:55,528 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.612e+02 7.254e+02 1.118e+03 1.690e+03 3.507e+03, threshold=2.236e+03, percent-clipped=29.0 2023-06-27 12:18:44,463 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1805232.0, ans=0.09899494936611666 2023-06-27 12:19:21,668 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1805352.0, ans=0.025 2023-06-27 12:19:36,213 INFO [train.py:996] (2/4) Epoch 10, batch 26450, loss[loss=0.2352, simple_loss=0.3641, pruned_loss=0.05315, over 20794.00 frames. ], tot_loss[loss=0.223, simple_loss=0.2998, pruned_loss=0.07307, over 4269663.81 frames. ], batch size: 607, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:19:48,871 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1805412.0, ans=0.125 2023-06-27 12:20:10,540 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1805472.0, ans=0.1 2023-06-27 12:20:23,197 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1805532.0, ans=0.1 2023-06-27 12:20:32,471 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1805532.0, ans=0.015 2023-06-27 12:20:41,017 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1805592.0, ans=0.125 2023-06-27 12:21:09,486 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 12:21:13,183 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1805652.0, ans=0.125 2023-06-27 12:21:14,891 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1805652.0, ans=0.1 2023-06-27 12:21:19,540 INFO [train.py:996] (2/4) Epoch 10, batch 26500, loss[loss=0.2072, simple_loss=0.2809, pruned_loss=0.06678, over 21764.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.3, pruned_loss=0.07073, over 4264994.74 frames. ], batch size: 247, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:21:28,834 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.787e+02 8.473e+02 1.317e+03 2.228e+03 4.940e+03, threshold=2.635e+03, percent-clipped=24.0 2023-06-27 12:21:37,621 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.77 vs. limit=15.0 2023-06-27 12:21:39,095 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.84 vs. limit=15.0 2023-06-27 12:22:30,297 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1805892.0, ans=0.125 2023-06-27 12:23:07,847 INFO [train.py:996] (2/4) Epoch 10, batch 26550, loss[loss=0.2017, simple_loss=0.3033, pruned_loss=0.05002, over 21640.00 frames. ], tot_loss[loss=0.2181, simple_loss=0.2979, pruned_loss=0.06912, over 4262704.94 frames. ], batch size: 414, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:23:47,401 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1806072.0, ans=0.0 2023-06-27 12:23:57,649 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1806132.0, ans=0.0 2023-06-27 12:24:10,956 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1806192.0, ans=0.05 2023-06-27 12:24:27,080 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1806192.0, ans=0.04949747468305833 2023-06-27 12:24:33,809 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=1806252.0, ans=10.0 2023-06-27 12:24:34,860 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1806252.0, ans=0.125 2023-06-27 12:24:36,515 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1806252.0, ans=0.125 2023-06-27 12:24:38,272 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1806252.0, ans=0.125 2023-06-27 12:24:53,640 INFO [train.py:996] (2/4) Epoch 10, batch 26600, loss[loss=0.1933, simple_loss=0.2709, pruned_loss=0.05782, over 21605.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.2963, pruned_loss=0.06667, over 4256017.93 frames. ], batch size: 298, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:25:02,996 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.900e+02 9.676e+02 1.340e+03 1.727e+03 3.782e+03, threshold=2.679e+03, percent-clipped=7.0 2023-06-27 12:25:05,074 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1806312.0, ans=10.0 2023-06-27 12:25:31,802 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.05 vs. limit=22.5 2023-06-27 12:26:12,313 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1806552.0, ans=0.0 2023-06-27 12:26:36,273 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1806552.0, ans=0.125 2023-06-27 12:26:38,851 INFO [train.py:996] (2/4) Epoch 10, batch 26650, loss[loss=0.1909, simple_loss=0.2706, pruned_loss=0.0556, over 21640.00 frames. ], tot_loss[loss=0.2099, simple_loss=0.2895, pruned_loss=0.06519, over 4263110.25 frames. ], batch size: 415, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:26:59,984 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1806672.0, ans=0.125 2023-06-27 12:27:17,774 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1806732.0, ans=0.125 2023-06-27 12:27:19,465 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1806732.0, ans=0.125 2023-06-27 12:27:41,643 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1806792.0, ans=0.125 2023-06-27 12:28:09,302 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1806852.0, ans=0.0 2023-06-27 12:28:18,264 INFO [train.py:996] (2/4) Epoch 10, batch 26700, loss[loss=0.1899, simple_loss=0.2621, pruned_loss=0.05883, over 21538.00 frames. ], tot_loss[loss=0.2038, simple_loss=0.2831, pruned_loss=0.06222, over 4267775.72 frames. ], batch size: 212, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:28:23,315 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.270e+02 4.740e+02 5.974e+02 7.751e+02 2.095e+03, threshold=1.195e+03, percent-clipped=0.0 2023-06-27 12:28:42,927 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1806972.0, ans=0.0 2023-06-27 12:28:52,813 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1806972.0, ans=0.1 2023-06-27 12:28:52,904 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1806972.0, ans=0.125 2023-06-27 12:29:06,313 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1807032.0, ans=0.0 2023-06-27 12:30:03,647 INFO [train.py:996] (2/4) Epoch 10, batch 26750, loss[loss=0.2261, simple_loss=0.309, pruned_loss=0.07156, over 21313.00 frames. ], tot_loss[loss=0.2026, simple_loss=0.2828, pruned_loss=0.06117, over 4276933.41 frames. ], batch size: 159, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:31:45,852 INFO [train.py:996] (2/4) Epoch 10, batch 26800, loss[loss=0.1956, simple_loss=0.2664, pruned_loss=0.06235, over 21923.00 frames. ], tot_loss[loss=0.2094, simple_loss=0.2898, pruned_loss=0.06447, over 4270823.84 frames. ], batch size: 98, lr: 2.88e-03, grad_scale: 32.0 2023-06-27 12:31:51,192 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.485e+02 8.253e+02 1.353e+03 2.004e+03 3.922e+03, threshold=2.706e+03, percent-clipped=54.0 2023-06-27 12:32:21,163 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1807572.0, ans=0.0 2023-06-27 12:33:02,929 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1807692.0, ans=0.125 2023-06-27 12:33:19,663 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1807752.0, ans=0.2 2023-06-27 12:33:27,234 INFO [train.py:996] (2/4) Epoch 10, batch 26850, loss[loss=0.2067, simple_loss=0.275, pruned_loss=0.06923, over 21389.00 frames. ], tot_loss[loss=0.2118, simple_loss=0.2901, pruned_loss=0.06669, over 4270363.22 frames. ], batch size: 131, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:33:39,316 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1807812.0, ans=0.125 2023-06-27 12:33:42,662 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1807872.0, ans=0.0 2023-06-27 12:33:42,672 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1807872.0, ans=0.1 2023-06-27 12:33:55,362 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.14 vs. limit=12.0 2023-06-27 12:33:58,589 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1807872.0, ans=0.0 2023-06-27 12:34:39,943 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1807992.0, ans=0.1 2023-06-27 12:35:07,060 INFO [train.py:996] (2/4) Epoch 10, batch 26900, loss[loss=0.1852, simple_loss=0.2473, pruned_loss=0.06154, over 21661.00 frames. ], tot_loss[loss=0.2075, simple_loss=0.2823, pruned_loss=0.06631, over 4269377.66 frames. ], batch size: 282, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:35:13,695 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.943e+02 6.437e+02 8.362e+02 1.264e+03 2.899e+03, threshold=1.672e+03, percent-clipped=1.0 2023-06-27 12:35:25,701 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1808172.0, ans=0.1 2023-06-27 12:36:46,478 INFO [train.py:996] (2/4) Epoch 10, batch 26950, loss[loss=0.2351, simple_loss=0.3231, pruned_loss=0.07354, over 21428.00 frames. ], tot_loss[loss=0.2067, simple_loss=0.2816, pruned_loss=0.06588, over 4272762.28 frames. ], batch size: 194, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:37:09,990 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1808472.0, ans=0.125 2023-06-27 12:37:16,679 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1808472.0, ans=0.1 2023-06-27 12:38:27,739 INFO [train.py:996] (2/4) Epoch 10, batch 27000, loss[loss=0.1841, simple_loss=0.2666, pruned_loss=0.05079, over 21353.00 frames. ], tot_loss[loss=0.2065, simple_loss=0.2834, pruned_loss=0.06483, over 4269343.54 frames. ], batch size: 194, lr: 2.88e-03, grad_scale: 8.0 2023-06-27 12:38:27,740 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-27 12:38:47,566 INFO [train.py:1028] (2/4) Epoch 10, validation: loss=0.2474, simple_loss=0.3368, pruned_loss=0.07904, over 1796401.00 frames. 2023-06-27 12:38:47,567 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 23731MB 2023-06-27 12:38:50,922 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.70 vs. limit=10.0 2023-06-27 12:39:01,424 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.008e+02 5.840e+02 8.267e+02 1.216e+03 2.372e+03, threshold=1.653e+03, percent-clipped=7.0 2023-06-27 12:39:38,504 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1808832.0, ans=0.0 2023-06-27 12:40:08,556 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.75 vs. limit=10.0 2023-06-27 12:40:13,823 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.81 vs. limit=15.0 2023-06-27 12:40:29,866 INFO [train.py:996] (2/4) Epoch 10, batch 27050, loss[loss=0.1931, simple_loss=0.2823, pruned_loss=0.05192, over 21095.00 frames. ], tot_loss[loss=0.2053, simple_loss=0.2864, pruned_loss=0.06211, over 4273402.77 frames. ], batch size: 607, lr: 2.88e-03, grad_scale: 8.0 2023-06-27 12:41:05,682 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1809072.0, ans=0.0 2023-06-27 12:41:07,187 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1809072.0, ans=0.2 2023-06-27 12:42:10,022 INFO [train.py:996] (2/4) Epoch 10, batch 27100, loss[loss=0.1824, simple_loss=0.2587, pruned_loss=0.05309, over 21796.00 frames. ], tot_loss[loss=0.2068, simple_loss=0.2878, pruned_loss=0.06287, over 4279583.82 frames. ], batch size: 102, lr: 2.88e-03, grad_scale: 8.0 2023-06-27 12:42:22,704 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.015e+02 5.614e+02 8.365e+02 1.169e+03 2.643e+03, threshold=1.673e+03, percent-clipped=10.0 2023-06-27 12:42:41,607 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1809372.0, ans=0.2 2023-06-27 12:43:24,427 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1809492.0, ans=0.125 2023-06-27 12:43:32,899 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1809552.0, ans=0.125 2023-06-27 12:43:51,703 INFO [train.py:996] (2/4) Epoch 10, batch 27150, loss[loss=0.262, simple_loss=0.3547, pruned_loss=0.08464, over 21757.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.3, pruned_loss=0.0663, over 4275065.69 frames. ], batch size: 332, lr: 2.88e-03, grad_scale: 8.0 2023-06-27 12:44:10,316 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1809612.0, ans=0.0 2023-06-27 12:44:35,573 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.50 vs. limit=15.0 2023-06-27 12:44:38,144 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1809732.0, ans=0.125 2023-06-27 12:44:39,759 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1809732.0, ans=0.125 2023-06-27 12:44:55,512 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.53 vs. limit=10.0 2023-06-27 12:45:31,917 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 12:45:37,928 INFO [train.py:996] (2/4) Epoch 10, batch 27200, loss[loss=0.2839, simple_loss=0.3533, pruned_loss=0.1073, over 21444.00 frames. ], tot_loss[loss=0.223, simple_loss=0.308, pruned_loss=0.06899, over 4273166.33 frames. ], batch size: 471, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:45:50,768 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.524e+02 6.689e+02 1.006e+03 1.593e+03 2.972e+03, threshold=2.013e+03, percent-clipped=22.0 2023-06-27 12:45:56,329 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1809912.0, ans=0.125 2023-06-27 12:46:06,468 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1809972.0, ans=0.2 2023-06-27 12:46:11,359 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=1809972.0, ans=0.05 2023-06-27 12:46:38,461 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.30 vs. limit=15.0 2023-06-27 12:47:19,001 INFO [train.py:996] (2/4) Epoch 10, batch 27250, loss[loss=0.2286, simple_loss=0.3091, pruned_loss=0.07406, over 21627.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.3103, pruned_loss=0.07252, over 4277855.35 frames. ], batch size: 263, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:47:21,597 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1810212.0, ans=0.0 2023-06-27 12:47:24,677 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1810212.0, ans=0.0 2023-06-27 12:47:40,585 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1810272.0, ans=0.125 2023-06-27 12:48:51,822 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1810452.0, ans=0.125 2023-06-27 12:48:53,718 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1810452.0, ans=0.125 2023-06-27 12:48:58,237 INFO [train.py:996] (2/4) Epoch 10, batch 27300, loss[loss=0.1829, simple_loss=0.255, pruned_loss=0.05545, over 16629.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3117, pruned_loss=0.07351, over 4275447.07 frames. ], batch size: 60, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:49:06,654 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.182e+02 6.508e+02 9.291e+02 1.314e+03 3.410e+03, threshold=1.858e+03, percent-clipped=10.0 2023-06-27 12:49:07,316 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1810512.0, ans=0.0 2023-06-27 12:49:18,866 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1810572.0, ans=0.07 2023-06-27 12:50:11,908 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.06 vs. limit=15.0 2023-06-27 12:50:22,668 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1810752.0, ans=0.125 2023-06-27 12:50:34,410 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.12 vs. limit=12.0 2023-06-27 12:50:38,159 INFO [train.py:996] (2/4) Epoch 10, batch 27350, loss[loss=0.2404, simple_loss=0.3272, pruned_loss=0.07679, over 21524.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.3147, pruned_loss=0.07379, over 4275035.23 frames. ], batch size: 471, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:50:51,563 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 12:51:35,047 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=1810992.0, ans=15.0 2023-06-27 12:51:42,329 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1810992.0, ans=0.0 2023-06-27 12:52:12,673 INFO [train.py:996] (2/4) Epoch 10, batch 27400, loss[loss=0.2161, simple_loss=0.2858, pruned_loss=0.07324, over 21812.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.3104, pruned_loss=0.07369, over 4281240.81 frames. ], batch size: 118, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:52:20,978 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.204e+02 5.726e+02 8.020e+02 1.365e+03 2.836e+03, threshold=1.604e+03, percent-clipped=8.0 2023-06-27 12:52:51,156 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 12:53:07,791 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1811232.0, ans=0.125 2023-06-27 12:53:15,777 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1811232.0, ans=0.0 2023-06-27 12:53:22,147 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1811292.0, ans=0.125 2023-06-27 12:53:30,452 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1811292.0, ans=0.125 2023-06-27 12:53:51,565 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1811352.0, ans=0.1 2023-06-27 12:53:54,327 INFO [train.py:996] (2/4) Epoch 10, batch 27450, loss[loss=0.1944, simple_loss=0.2832, pruned_loss=0.05278, over 21410.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.3042, pruned_loss=0.07224, over 4280243.20 frames. ], batch size: 194, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:54:22,519 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1811472.0, ans=0.0 2023-06-27 12:54:35,636 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1811532.0, ans=0.0 2023-06-27 12:54:49,855 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1811532.0, ans=0.0 2023-06-27 12:55:30,303 INFO [train.py:996] (2/4) Epoch 10, batch 27500, loss[loss=0.1901, simple_loss=0.2721, pruned_loss=0.054, over 21662.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.3018, pruned_loss=0.07231, over 4284327.66 frames. ], batch size: 230, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:55:32,522 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1811712.0, ans=0.1 2023-06-27 12:55:38,230 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.956e+02 6.120e+02 9.251e+02 1.541e+03 3.924e+03, threshold=1.850e+03, percent-clipped=23.0 2023-06-27 12:56:28,123 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1811832.0, ans=0.0 2023-06-27 12:56:31,456 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1811832.0, ans=0.125 2023-06-27 12:56:32,887 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1811832.0, ans=0.125 2023-06-27 12:56:40,853 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1811892.0, ans=0.0 2023-06-27 12:57:09,554 INFO [train.py:996] (2/4) Epoch 10, batch 27550, loss[loss=0.219, simple_loss=0.2819, pruned_loss=0.07811, over 21285.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.2973, pruned_loss=0.06986, over 4281070.58 frames. ], batch size: 471, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 12:57:10,660 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.49 vs. limit=15.0 2023-06-27 12:57:13,443 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1812012.0, ans=0.125 2023-06-27 12:57:42,117 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_na.min_abs, batch_count=1812072.0, ans=0.02 2023-06-27 12:58:07,324 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1812132.0, ans=0.125 2023-06-27 12:58:12,104 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1812132.0, ans=0.025 2023-06-27 12:58:48,304 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.37 vs. limit=15.0 2023-06-27 12:58:48,793 INFO [train.py:996] (2/4) Epoch 10, batch 27600, loss[loss=0.1907, simple_loss=0.2626, pruned_loss=0.05936, over 21616.00 frames. ], tot_loss[loss=0.2142, simple_loss=0.2909, pruned_loss=0.06879, over 4257835.87 frames. ], batch size: 298, lr: 2.87e-03, grad_scale: 32.0 2023-06-27 12:58:55,838 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1812312.0, ans=0.035 2023-06-27 12:58:56,896 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.403e+02 6.402e+02 9.119e+02 1.240e+03 2.150e+03, threshold=1.824e+03, percent-clipped=4.0 2023-06-27 13:00:03,744 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1812492.0, ans=0.125 2023-06-27 13:00:29,609 INFO [train.py:996] (2/4) Epoch 10, batch 27650, loss[loss=0.2001, simple_loss=0.2753, pruned_loss=0.06241, over 21848.00 frames. ], tot_loss[loss=0.2108, simple_loss=0.2856, pruned_loss=0.06795, over 4260980.89 frames. ], batch size: 98, lr: 2.87e-03, grad_scale: 32.0 2023-06-27 13:01:02,342 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1812672.0, ans=0.0 2023-06-27 13:01:18,886 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.97 vs. limit=22.5 2023-06-27 13:01:20,121 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1812732.0, ans=0.125 2023-06-27 13:02:02,901 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1812852.0, ans=0.0 2023-06-27 13:02:04,378 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 13:02:10,592 INFO [train.py:996] (2/4) Epoch 10, batch 27700, loss[loss=0.2145, simple_loss=0.2966, pruned_loss=0.06621, over 21630.00 frames. ], tot_loss[loss=0.2091, simple_loss=0.2856, pruned_loss=0.0663, over 4257095.73 frames. ], batch size: 230, lr: 2.87e-03, grad_scale: 32.0 2023-06-27 13:02:23,379 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.384e+02 6.863e+02 9.869e+02 1.519e+03 3.382e+03, threshold=1.974e+03, percent-clipped=13.0 2023-06-27 13:02:44,783 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1812972.0, ans=0.125 2023-06-27 13:03:18,242 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1813092.0, ans=0.0 2023-06-27 13:03:32,985 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1813152.0, ans=0.0 2023-06-27 13:03:44,669 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1813152.0, ans=0.1 2023-06-27 13:03:50,247 INFO [train.py:996] (2/4) Epoch 10, batch 27750, loss[loss=0.1508, simple_loss=0.2243, pruned_loss=0.0387, over 16558.00 frames. ], tot_loss[loss=0.2097, simple_loss=0.2879, pruned_loss=0.06568, over 4262460.00 frames. ], batch size: 60, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:05:16,249 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1813452.0, ans=0.0 2023-06-27 13:05:16,277 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1813452.0, ans=0.125 2023-06-27 13:05:28,656 INFO [train.py:996] (2/4) Epoch 10, batch 27800, loss[loss=0.1869, simple_loss=0.2521, pruned_loss=0.06081, over 21185.00 frames. ], tot_loss[loss=0.2094, simple_loss=0.2875, pruned_loss=0.06568, over 4270613.82 frames. ], batch size: 608, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:05:43,189 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.682e+02 6.752e+02 9.329e+02 1.344e+03 2.939e+03, threshold=1.866e+03, percent-clipped=10.0 2023-06-27 13:06:34,950 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.whiten.whitening_limit, batch_count=1813692.0, ans=15.0 2023-06-27 13:06:39,853 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.47 vs. limit=12.0 2023-06-27 13:07:09,263 INFO [train.py:996] (2/4) Epoch 10, batch 27850, loss[loss=0.2302, simple_loss=0.3074, pruned_loss=0.07647, over 21896.00 frames. ], tot_loss[loss=0.2103, simple_loss=0.2869, pruned_loss=0.06686, over 4274519.34 frames. ], batch size: 107, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:08:03,621 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1813932.0, ans=0.1 2023-06-27 13:08:12,120 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1813932.0, ans=0.125 2023-06-27 13:08:17,309 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1813992.0, ans=0.125 2023-06-27 13:08:39,536 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1814052.0, ans=0.125 2023-06-27 13:09:01,135 INFO [train.py:996] (2/4) Epoch 10, batch 27900, loss[loss=0.1808, simple_loss=0.2544, pruned_loss=0.05365, over 16850.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.2935, pruned_loss=0.06732, over 4271409.39 frames. ], batch size: 61, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:09:15,843 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.509e+02 6.352e+02 8.865e+02 1.400e+03 2.806e+03, threshold=1.773e+03, percent-clipped=7.0 2023-06-27 13:09:26,547 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1814172.0, ans=0.1 2023-06-27 13:09:46,973 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.98 vs. limit=15.0 2023-06-27 13:09:55,195 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.40 vs. limit=15.0 2023-06-27 13:10:48,735 INFO [train.py:996] (2/4) Epoch 10, batch 27950, loss[loss=0.2063, simple_loss=0.2947, pruned_loss=0.05893, over 21748.00 frames. ], tot_loss[loss=0.2115, simple_loss=0.2935, pruned_loss=0.06473, over 4266972.31 frames. ], batch size: 298, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:11:05,269 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1814412.0, ans=0.2 2023-06-27 13:11:12,600 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.78 vs. limit=15.0 2023-06-27 13:12:27,039 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1814712.0, ans=0.125 2023-06-27 13:12:28,087 INFO [train.py:996] (2/4) Epoch 10, batch 28000, loss[loss=0.1896, simple_loss=0.2827, pruned_loss=0.04822, over 21038.00 frames. ], tot_loss[loss=0.2093, simple_loss=0.2926, pruned_loss=0.06306, over 4272182.06 frames. ], batch size: 607, lr: 2.87e-03, grad_scale: 32.0 2023-06-27 13:12:42,695 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.320e+02 5.982e+02 8.841e+02 1.274e+03 3.365e+03, threshold=1.768e+03, percent-clipped=7.0 2023-06-27 13:13:01,410 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1814772.0, ans=0.125 2023-06-27 13:13:08,816 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.51 vs. limit=12.0 2023-06-27 13:14:14,332 INFO [train.py:996] (2/4) Epoch 10, batch 28050, loss[loss=0.1887, simple_loss=0.2612, pruned_loss=0.05816, over 21760.00 frames. ], tot_loss[loss=0.2095, simple_loss=0.2901, pruned_loss=0.06444, over 4276789.92 frames. ], batch size: 247, lr: 2.87e-03, grad_scale: 32.0 2023-06-27 13:14:36,237 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1815072.0, ans=0.125 2023-06-27 13:14:50,120 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.46 vs. limit=15.0 2023-06-27 13:15:44,093 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.20 vs. limit=12.0 2023-06-27 13:15:46,741 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1815252.0, ans=0.125 2023-06-27 13:15:54,367 INFO [train.py:996] (2/4) Epoch 10, batch 28100, loss[loss=0.1833, simple_loss=0.2526, pruned_loss=0.05696, over 21634.00 frames. ], tot_loss[loss=0.2094, simple_loss=0.2888, pruned_loss=0.065, over 4276077.96 frames. ], batch size: 282, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:15:59,968 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1815312.0, ans=0.0 2023-06-27 13:16:02,027 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.49 vs. limit=22.5 2023-06-27 13:16:06,171 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.041e+02 5.968e+02 9.165e+02 1.416e+03 2.614e+03, threshold=1.833e+03, percent-clipped=9.0 2023-06-27 13:16:16,841 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.22 vs. limit=15.0 2023-06-27 13:16:53,614 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.46 vs. limit=15.0 2023-06-27 13:17:34,197 INFO [train.py:996] (2/4) Epoch 10, batch 28150, loss[loss=0.1698, simple_loss=0.2212, pruned_loss=0.05919, over 20725.00 frames. ], tot_loss[loss=0.2056, simple_loss=0.2813, pruned_loss=0.06497, over 4276417.12 frames. ], batch size: 608, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:18:00,864 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1815672.0, ans=0.125 2023-06-27 13:18:37,991 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1815792.0, ans=0.0 2023-06-27 13:18:54,089 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1815792.0, ans=0.1 2023-06-27 13:19:14,724 INFO [train.py:996] (2/4) Epoch 10, batch 28200, loss[loss=0.2555, simple_loss=0.321, pruned_loss=0.09499, over 21665.00 frames. ], tot_loss[loss=0.2064, simple_loss=0.2801, pruned_loss=0.06633, over 4269778.20 frames. ], batch size: 441, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:19:21,176 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.96 vs. limit=22.5 2023-06-27 13:19:26,346 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.905e+02 6.047e+02 9.821e+02 1.464e+03 4.986e+03, threshold=1.964e+03, percent-clipped=9.0 2023-06-27 13:20:05,695 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1816032.0, ans=0.05 2023-06-27 13:20:09,138 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1816032.0, ans=0.125 2023-06-27 13:20:54,971 INFO [train.py:996] (2/4) Epoch 10, batch 28250, loss[loss=0.1954, simple_loss=0.2611, pruned_loss=0.06487, over 21572.00 frames. ], tot_loss[loss=0.2101, simple_loss=0.2829, pruned_loss=0.06866, over 4272434.19 frames. ], batch size: 263, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:20:57,379 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1816212.0, ans=0.0 2023-06-27 13:21:02,642 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.34 vs. limit=15.0 2023-06-27 13:21:34,314 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1816332.0, ans=0.04949747468305833 2023-06-27 13:22:17,079 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.10 vs. limit=22.5 2023-06-27 13:22:36,317 INFO [train.py:996] (2/4) Epoch 10, batch 28300, loss[loss=0.1899, simple_loss=0.2614, pruned_loss=0.05919, over 21861.00 frames. ], tot_loss[loss=0.2075, simple_loss=0.2813, pruned_loss=0.06679, over 4276567.80 frames. ], batch size: 107, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:22:47,918 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.032e+02 5.786e+02 9.744e+02 1.588e+03 3.149e+03, threshold=1.949e+03, percent-clipped=13.0 2023-06-27 13:23:18,250 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1816632.0, ans=0.2 2023-06-27 13:23:31,625 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.76 vs. limit=15.0 2023-06-27 13:24:15,597 INFO [train.py:996] (2/4) Epoch 10, batch 28350, loss[loss=0.1839, simple_loss=0.2557, pruned_loss=0.05601, over 21240.00 frames. ], tot_loss[loss=0.201, simple_loss=0.2787, pruned_loss=0.06161, over 4280963.34 frames. ], batch size: 549, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:24:47,854 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1816872.0, ans=0.2 2023-06-27 13:25:33,880 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1816992.0, ans=0.125 2023-06-27 13:25:39,192 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.14 vs. limit=15.0 2023-06-27 13:25:55,978 INFO [train.py:996] (2/4) Epoch 10, batch 28400, loss[loss=0.2152, simple_loss=0.2788, pruned_loss=0.0758, over 21503.00 frames. ], tot_loss[loss=0.1999, simple_loss=0.2765, pruned_loss=0.06165, over 4271379.88 frames. ], batch size: 389, lr: 2.87e-03, grad_scale: 32.0 2023-06-27 13:26:18,382 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.150e+02 6.326e+02 1.038e+03 1.651e+03 3.367e+03, threshold=2.075e+03, percent-clipped=16.0 2023-06-27 13:26:20,877 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1817172.0, ans=0.125 2023-06-27 13:26:35,072 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 13:27:07,643 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1817292.0, ans=0.0 2023-06-27 13:27:37,251 INFO [train.py:996] (2/4) Epoch 10, batch 28450, loss[loss=0.2191, simple_loss=0.2962, pruned_loss=0.07098, over 21753.00 frames. ], tot_loss[loss=0.2077, simple_loss=0.284, pruned_loss=0.06574, over 4270554.55 frames. ], batch size: 298, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:27:38,117 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1817412.0, ans=0.125 2023-06-27 13:27:50,553 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1817412.0, ans=0.1 2023-06-27 13:27:58,404 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1817412.0, ans=0.125 2023-06-27 13:28:10,937 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1817472.0, ans=0.125 2023-06-27 13:28:23,320 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1817472.0, ans=0.0 2023-06-27 13:28:33,371 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1817532.0, ans=0.0 2023-06-27 13:28:51,962 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1817592.0, ans=0.125 2023-06-27 13:28:52,002 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1817592.0, ans=0.1 2023-06-27 13:29:04,371 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1817652.0, ans=0.125 2023-06-27 13:29:27,868 INFO [train.py:996] (2/4) Epoch 10, batch 28500, loss[loss=0.2034, simple_loss=0.277, pruned_loss=0.06488, over 20761.00 frames. ], tot_loss[loss=0.2117, simple_loss=0.2866, pruned_loss=0.06845, over 4275774.38 frames. ], batch size: 607, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:29:28,521 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1817712.0, ans=0.125 2023-06-27 13:29:39,673 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1817712.0, ans=0.1 2023-06-27 13:29:50,080 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.54 vs. limit=15.0 2023-06-27 13:29:50,451 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.832e+02 6.822e+02 1.044e+03 1.325e+03 2.451e+03, threshold=2.088e+03, percent-clipped=2.0 2023-06-27 13:29:56,041 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1817772.0, ans=0.125 2023-06-27 13:30:06,155 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1817772.0, ans=0.125 2023-06-27 13:30:26,689 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.15 vs. limit=12.0 2023-06-27 13:30:53,624 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1817952.0, ans=0.05 2023-06-27 13:31:14,231 INFO [train.py:996] (2/4) Epoch 10, batch 28550, loss[loss=0.2366, simple_loss=0.3396, pruned_loss=0.0668, over 21903.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.294, pruned_loss=0.07031, over 4277683.35 frames. ], batch size: 317, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:31:50,630 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1818132.0, ans=0.0 2023-06-27 13:32:37,597 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1818252.0, ans=0.0 2023-06-27 13:32:59,169 INFO [train.py:996] (2/4) Epoch 10, batch 28600, loss[loss=0.2284, simple_loss=0.3071, pruned_loss=0.07485, over 21431.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.3004, pruned_loss=0.07254, over 4280895.54 frames. ], batch size: 211, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:33:12,232 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.169e+02 6.322e+02 9.283e+02 1.275e+03 2.692e+03, threshold=1.857e+03, percent-clipped=3.0 2023-06-27 13:33:31,658 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.89 vs. limit=6.0 2023-06-27 13:33:47,656 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1818492.0, ans=0.2 2023-06-27 13:34:23,076 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1818552.0, ans=0.2 2023-06-27 13:34:40,151 INFO [train.py:996] (2/4) Epoch 10, batch 28650, loss[loss=0.1811, simple_loss=0.2445, pruned_loss=0.05888, over 21328.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.2934, pruned_loss=0.07111, over 4284991.66 frames. ], batch size: 211, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:35:22,396 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1818732.0, ans=0.125 2023-06-27 13:35:27,210 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1818732.0, ans=0.125 2023-06-27 13:35:36,770 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1818792.0, ans=0.125 2023-06-27 13:36:16,629 INFO [train.py:996] (2/4) Epoch 10, batch 28700, loss[loss=0.3008, simple_loss=0.3511, pruned_loss=0.1252, over 21448.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.2937, pruned_loss=0.07267, over 4285325.91 frames. ], batch size: 507, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:36:19,474 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.38 vs. limit=15.0 2023-06-27 13:36:29,764 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.086e+02 6.900e+02 1.037e+03 1.524e+03 3.185e+03, threshold=2.075e+03, percent-clipped=14.0 2023-06-27 13:36:35,193 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1818972.0, ans=0.04949747468305833 2023-06-27 13:36:42,292 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.70 vs. limit=15.0 2023-06-27 13:36:43,093 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1818972.0, ans=0.0 2023-06-27 13:37:17,437 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1819092.0, ans=0.0 2023-06-27 13:37:22,400 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1819092.0, ans=0.125 2023-06-27 13:37:49,938 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1819152.0, ans=0.125 2023-06-27 13:37:57,658 INFO [train.py:996] (2/4) Epoch 10, batch 28750, loss[loss=0.2299, simple_loss=0.2972, pruned_loss=0.08129, over 21909.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.2932, pruned_loss=0.07296, over 4288206.33 frames. ], batch size: 414, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:38:06,655 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1819212.0, ans=0.125 2023-06-27 13:38:11,437 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1819212.0, ans=0.1 2023-06-27 13:38:22,973 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1819272.0, ans=0.125 2023-06-27 13:39:33,335 INFO [train.py:996] (2/4) Epoch 10, batch 28800, loss[loss=0.1904, simple_loss=0.31, pruned_loss=0.03536, over 19869.00 frames. ], tot_loss[loss=0.222, simple_loss=0.2974, pruned_loss=0.0733, over 4281684.36 frames. ], batch size: 702, lr: 2.87e-03, grad_scale: 32.0 2023-06-27 13:39:47,038 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.078e+02 7.759e+02 9.840e+02 1.249e+03 3.010e+03, threshold=1.968e+03, percent-clipped=7.0 2023-06-27 13:41:09,923 INFO [train.py:996] (2/4) Epoch 10, batch 28850, loss[loss=0.1961, simple_loss=0.2662, pruned_loss=0.063, over 21665.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.2977, pruned_loss=0.07404, over 4284270.78 frames. ], batch size: 230, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:42:08,408 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1819932.0, ans=0.05 2023-06-27 13:42:21,331 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1819992.0, ans=0.0 2023-06-27 13:42:50,085 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.59 vs. limit=15.0 2023-06-27 13:42:50,437 INFO [train.py:996] (2/4) Epoch 10, batch 28900, loss[loss=0.2357, simple_loss=0.2987, pruned_loss=0.08637, over 21290.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.3006, pruned_loss=0.07554, over 4284121.62 frames. ], batch size: 143, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:43:05,416 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.892e+02 6.958e+02 1.036e+03 1.416e+03 3.093e+03, threshold=2.073e+03, percent-clipped=9.0 2023-06-27 13:43:07,620 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1820172.0, ans=0.0 2023-06-27 13:43:09,259 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1820172.0, ans=0.0 2023-06-27 13:43:52,875 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1820232.0, ans=0.125 2023-06-27 13:43:58,391 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1820292.0, ans=0.2 2023-06-27 13:44:17,123 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1820352.0, ans=0.035 2023-06-27 13:44:33,633 INFO [train.py:996] (2/4) Epoch 10, batch 28950, loss[loss=0.2418, simple_loss=0.3515, pruned_loss=0.06601, over 21210.00 frames. ], tot_loss[loss=0.227, simple_loss=0.3035, pruned_loss=0.07529, over 4276197.28 frames. ], batch size: 548, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:44:41,049 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1820412.0, ans=0.125 2023-06-27 13:44:55,020 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1820412.0, ans=0.0 2023-06-27 13:45:20,754 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.14 vs. limit=15.0 2023-06-27 13:45:51,281 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1820592.0, ans=0.0 2023-06-27 13:46:15,102 INFO [train.py:996] (2/4) Epoch 10, batch 29000, loss[loss=0.2683, simple_loss=0.3382, pruned_loss=0.0992, over 21769.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.3061, pruned_loss=0.07502, over 4269384.95 frames. ], batch size: 441, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:46:37,856 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1820712.0, ans=0.04949747468305833 2023-06-27 13:46:43,662 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.952e+02 6.978e+02 9.216e+02 1.338e+03 4.286e+03, threshold=1.843e+03, percent-clipped=9.0 2023-06-27 13:46:49,578 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.66 vs. limit=15.0 2023-06-27 13:46:54,001 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1820772.0, ans=0.125 2023-06-27 13:47:19,836 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.33 vs. limit=15.0 2023-06-27 13:47:26,331 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=6.66 vs. limit=15.0 2023-06-27 13:47:46,736 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1820952.0, ans=0.125 2023-06-27 13:48:04,739 INFO [train.py:996] (2/4) Epoch 10, batch 29050, loss[loss=0.224, simple_loss=0.2927, pruned_loss=0.07763, over 21808.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.3048, pruned_loss=0.07547, over 4276531.91 frames. ], batch size: 112, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:48:23,290 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1821012.0, ans=0.5 2023-06-27 13:48:32,770 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1821072.0, ans=0.5 2023-06-27 13:49:22,518 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.95 vs. limit=10.0 2023-06-27 13:49:40,670 INFO [train.py:996] (2/4) Epoch 10, batch 29100, loss[loss=0.179, simple_loss=0.2448, pruned_loss=0.05654, over 21569.00 frames. ], tot_loss[loss=0.2219, simple_loss=0.2969, pruned_loss=0.07345, over 4280666.12 frames. ], batch size: 231, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:49:42,887 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1821312.0, ans=0.05 2023-06-27 13:49:47,851 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1821312.0, ans=0.125 2023-06-27 13:49:51,527 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1821312.0, ans=0.125 2023-06-27 13:49:55,604 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.400e+02 6.043e+02 9.332e+02 1.585e+03 3.722e+03, threshold=1.866e+03, percent-clipped=13.0 2023-06-27 13:50:07,762 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1821372.0, ans=0.125 2023-06-27 13:51:15,607 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1821612.0, ans=0.1 2023-06-27 13:51:16,640 INFO [train.py:996] (2/4) Epoch 10, batch 29150, loss[loss=0.2178, simple_loss=0.3105, pruned_loss=0.06256, over 21378.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.2944, pruned_loss=0.07151, over 4268140.63 frames. ], batch size: 176, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:51:21,376 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.17 vs. limit=10.0 2023-06-27 13:51:32,418 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1821672.0, ans=0.0 2023-06-27 13:51:35,768 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1821672.0, ans=0.0 2023-06-27 13:51:55,201 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1821732.0, ans=0.0 2023-06-27 13:52:03,643 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1821732.0, ans=0.125 2023-06-27 13:52:07,427 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.56 vs. limit=22.5 2023-06-27 13:52:57,674 INFO [train.py:996] (2/4) Epoch 10, batch 29200, loss[loss=0.2092, simple_loss=0.2959, pruned_loss=0.06122, over 21731.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2898, pruned_loss=0.07038, over 4263967.73 frames. ], batch size: 351, lr: 2.87e-03, grad_scale: 32.0 2023-06-27 13:53:13,074 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1821972.0, ans=0.2 2023-06-27 13:53:14,036 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.182e+02 6.160e+02 1.002e+03 1.749e+03 3.498e+03, threshold=2.004e+03, percent-clipped=20.0 2023-06-27 13:53:58,637 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1822092.0, ans=0.125 2023-06-27 13:54:29,616 INFO [train.py:996] (2/4) Epoch 10, batch 29250, loss[loss=0.1648, simple_loss=0.2478, pruned_loss=0.04093, over 21778.00 frames. ], tot_loss[loss=0.213, simple_loss=0.289, pruned_loss=0.06847, over 4268177.72 frames. ], batch size: 118, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:54:30,638 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1822212.0, ans=0.0 2023-06-27 13:55:50,887 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1822452.0, ans=0.125 2023-06-27 13:55:53,686 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1822452.0, ans=0.0 2023-06-27 13:56:06,669 INFO [train.py:996] (2/4) Epoch 10, batch 29300, loss[loss=0.1872, simple_loss=0.2602, pruned_loss=0.05709, over 21212.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.2899, pruned_loss=0.0679, over 4266317.41 frames. ], batch size: 159, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:56:08,800 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1822512.0, ans=0.125 2023-06-27 13:56:22,985 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.212e+02 5.568e+02 7.846e+02 1.257e+03 2.359e+03, threshold=1.569e+03, percent-clipped=3.0 2023-06-27 13:56:48,441 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1822632.0, ans=0.95 2023-06-27 13:56:53,981 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.94 vs. limit=15.0 2023-06-27 13:57:31,468 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1822752.0, ans=0.2 2023-06-27 13:57:31,526 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1822752.0, ans=0.0 2023-06-27 13:57:48,084 INFO [train.py:996] (2/4) Epoch 10, batch 29350, loss[loss=0.2145, simple_loss=0.2827, pruned_loss=0.07314, over 15276.00 frames. ], tot_loss[loss=0.211, simple_loss=0.2869, pruned_loss=0.06751, over 4263854.15 frames. ], batch size: 61, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:57:52,094 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1822812.0, ans=0.1 2023-06-27 13:57:59,357 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.03 vs. limit=6.0 2023-06-27 13:58:07,864 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.07 vs. limit=15.0 2023-06-27 13:59:16,822 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.54 vs. limit=15.0 2023-06-27 13:59:31,050 INFO [train.py:996] (2/4) Epoch 10, batch 29400, loss[loss=0.2022, simple_loss=0.3062, pruned_loss=0.0491, over 20796.00 frames. ], tot_loss[loss=0.2089, simple_loss=0.2867, pruned_loss=0.06552, over 4265033.43 frames. ], batch size: 609, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:59:47,674 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.061e+02 6.893e+02 1.012e+03 1.543e+03 3.903e+03, threshold=2.024e+03, percent-clipped=23.0 2023-06-27 13:59:51,441 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1823172.0, ans=0.0 2023-06-27 14:00:49,912 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1823292.0, ans=0.125 2023-06-27 14:01:12,883 INFO [train.py:996] (2/4) Epoch 10, batch 29450, loss[loss=0.2035, simple_loss=0.2795, pruned_loss=0.06377, over 21493.00 frames. ], tot_loss[loss=0.2077, simple_loss=0.2856, pruned_loss=0.06486, over 4257950.28 frames. ], batch size: 194, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 14:01:21,376 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1823412.0, ans=0.125 2023-06-27 14:02:45,740 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1823652.0, ans=0.125 2023-06-27 14:02:51,895 INFO [train.py:996] (2/4) Epoch 10, batch 29500, loss[loss=0.2031, simple_loss=0.2734, pruned_loss=0.06637, over 21345.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.2894, pruned_loss=0.0681, over 4269148.67 frames. ], batch size: 159, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 14:02:58,723 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1823712.0, ans=0.0 2023-06-27 14:03:07,703 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.631e+02 6.865e+02 1.061e+03 1.645e+03 3.419e+03, threshold=2.123e+03, percent-clipped=12.0 2023-06-27 14:03:09,743 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1823772.0, ans=0.125 2023-06-27 14:04:15,550 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1823952.0, ans=0.07 2023-06-27 14:04:33,493 INFO [train.py:996] (2/4) Epoch 10, batch 29550, loss[loss=0.2077, simple_loss=0.2744, pruned_loss=0.0705, over 21429.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.2885, pruned_loss=0.0696, over 4281427.16 frames. ], batch size: 211, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 14:04:54,614 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1824072.0, ans=0.125 2023-06-27 14:04:54,626 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1824072.0, ans=0.125 2023-06-27 14:04:54,637 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1824072.0, ans=0.0 2023-06-27 14:05:01,683 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.72 vs. limit=15.0 2023-06-27 14:06:03,515 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.80 vs. limit=15.0 2023-06-27 14:06:11,158 INFO [train.py:996] (2/4) Epoch 10, batch 29600, loss[loss=0.2842, simple_loss=0.4109, pruned_loss=0.07874, over 19863.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.2965, pruned_loss=0.07201, over 4286787.38 frames. ], batch size: 702, lr: 2.87e-03, grad_scale: 32.0 2023-06-27 14:06:25,154 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1824312.0, ans=0.0 2023-06-27 14:06:29,598 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.301e+02 5.949e+02 7.426e+02 9.960e+02 2.480e+03, threshold=1.485e+03, percent-clipped=1.0 2023-06-27 14:06:46,273 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1824372.0, ans=0.0 2023-06-27 14:07:14,740 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.23 vs. limit=15.0 2023-06-27 14:07:35,929 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.10 vs. limit=22.5 2023-06-27 14:07:42,004 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1824612.0, ans=0.2 2023-06-27 14:07:43,061 INFO [train.py:996] (2/4) Epoch 10, batch 29650, loss[loss=0.181, simple_loss=0.2562, pruned_loss=0.05291, over 21363.00 frames. ], tot_loss[loss=0.216, simple_loss=0.2947, pruned_loss=0.06866, over 4263561.38 frames. ], batch size: 194, lr: 2.86e-03, grad_scale: 16.0 2023-06-27 14:09:02,385 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1824852.0, ans=0.0 2023-06-27 14:09:20,374 INFO [train.py:996] (2/4) Epoch 10, batch 29700, loss[loss=0.2038, simple_loss=0.2762, pruned_loss=0.06567, over 16106.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2938, pruned_loss=0.06839, over 4269056.92 frames. ], batch size: 60, lr: 2.86e-03, grad_scale: 16.0 2023-06-27 14:09:24,108 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1824912.0, ans=0.0 2023-06-27 14:09:31,187 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1824912.0, ans=0.0 2023-06-27 14:09:41,901 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1824972.0, ans=0.2 2023-06-27 14:09:42,884 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.057e+02 7.297e+02 1.065e+03 1.869e+03 3.621e+03, threshold=2.131e+03, percent-clipped=32.0 2023-06-27 14:10:27,184 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1825092.0, ans=0.2 2023-06-27 14:10:31,260 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.60 vs. limit=15.0 2023-06-27 14:10:32,229 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1825092.0, ans=0.1 2023-06-27 14:10:50,261 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1825152.0, ans=0.125 2023-06-27 14:10:53,399 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1825152.0, ans=0.0 2023-06-27 14:10:56,007 INFO [train.py:996] (2/4) Epoch 10, batch 29750, loss[loss=0.2396, simple_loss=0.3183, pruned_loss=0.08048, over 21988.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2991, pruned_loss=0.06843, over 4276642.49 frames. ], batch size: 113, lr: 2.86e-03, grad_scale: 16.0 2023-06-27 14:11:10,136 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.93 vs. limit=15.0 2023-06-27 14:11:13,221 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.82 vs. limit=15.0 2023-06-27 14:11:25,391 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1825272.0, ans=0.125 2023-06-27 14:11:53,610 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.31 vs. limit=15.0 2023-06-27 14:12:27,397 INFO [train.py:996] (2/4) Epoch 10, batch 29800, loss[loss=0.1971, simple_loss=0.2779, pruned_loss=0.05812, over 21485.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.3005, pruned_loss=0.06955, over 4276999.82 frames. ], batch size: 194, lr: 2.86e-03, grad_scale: 16.0 2023-06-27 14:12:51,200 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.39 vs. limit=10.0 2023-06-27 14:12:51,511 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.238e+02 6.246e+02 9.031e+02 1.363e+03 2.753e+03, threshold=1.806e+03, percent-clipped=5.0 2023-06-27 14:13:52,276 INFO [train.py:996] (2/4) Epoch 10, batch 29850, loss[loss=0.2093, simple_loss=0.2877, pruned_loss=0.06541, over 21839.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2963, pruned_loss=0.06698, over 4280011.27 frames. ], batch size: 391, lr: 2.86e-03, grad_scale: 8.0 2023-06-27 14:13:56,668 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.53 vs. limit=22.5 2023-06-27 14:14:03,087 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.51 vs. limit=15.0 2023-06-27 14:14:07,547 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1825872.0, ans=0.1 2023-06-27 14:14:50,376 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1825932.0, ans=0.125 2023-06-27 14:15:04,218 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.32 vs. limit=15.0 2023-06-27 14:15:27,918 INFO [train.py:996] (2/4) Epoch 10, batch 29900, loss[loss=0.2216, simple_loss=0.2918, pruned_loss=0.07571, over 21760.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.2941, pruned_loss=0.06757, over 4289071.81 frames. ], batch size: 414, lr: 2.86e-03, grad_scale: 8.0 2023-06-27 14:16:07,293 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.080e+02 5.707e+02 7.601e+02 1.173e+03 3.198e+03, threshold=1.520e+03, percent-clipped=6.0 2023-06-27 14:16:14,895 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1826172.0, ans=0.2 2023-06-27 14:16:33,290 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1826232.0, ans=0.125 2023-06-27 14:16:48,227 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1826292.0, ans=0.125 2023-06-27 14:17:11,028 INFO [train.py:996] (2/4) Epoch 10, batch 29950, loss[loss=0.2521, simple_loss=0.3368, pruned_loss=0.08366, over 21784.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.2972, pruned_loss=0.07094, over 4290118.43 frames. ], batch size: 124, lr: 2.86e-03, grad_scale: 8.0 2023-06-27 14:17:32,578 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1826412.0, ans=0.1 2023-06-27 14:17:40,243 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.97 vs. limit=15.0 2023-06-27 14:17:43,228 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1826472.0, ans=0.0 2023-06-27 14:18:03,385 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.57 vs. limit=15.0 2023-06-27 14:18:04,753 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1826532.0, ans=0.0 2023-06-27 14:18:21,825 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1826592.0, ans=0.0 2023-06-27 14:19:05,000 INFO [train.py:996] (2/4) Epoch 10, batch 30000, loss[loss=0.2326, simple_loss=0.331, pruned_loss=0.06708, over 21598.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.2997, pruned_loss=0.07051, over 4285318.88 frames. ], batch size: 441, lr: 2.86e-03, grad_scale: 16.0 2023-06-27 14:19:05,001 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-27 14:19:19,072 INFO [zipformer.py:1728] (2/4) name=encoder.encoders.4.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([1.8010, 2.5423, 3.9934, 2.6271], device='cuda:2') 2023-06-27 14:19:22,078 INFO [train.py:1028] (2/4) Epoch 10, validation: loss=0.2475, simple_loss=0.3412, pruned_loss=0.07692, over 1796401.00 frames. 2023-06-27 14:19:22,078 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 23731MB 2023-06-27 14:19:43,676 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.141e+02 6.862e+02 9.553e+02 1.677e+03 3.481e+03, threshold=1.911e+03, percent-clipped=29.0 2023-06-27 14:19:50,260 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.12 vs. limit=10.0 2023-06-27 14:20:45,671 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.27 vs. limit=22.5 2023-06-27 14:20:48,650 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1826952.0, ans=0.1 2023-06-27 14:20:52,565 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1826952.0, ans=0.1 2023-06-27 14:21:02,209 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1826952.0, ans=0.1 2023-06-27 14:21:04,239 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1827012.0, ans=0.125 2023-06-27 14:21:05,364 INFO [train.py:996] (2/4) Epoch 10, batch 30050, loss[loss=0.1883, simple_loss=0.2605, pruned_loss=0.05809, over 19948.00 frames. ], tot_loss[loss=0.219, simple_loss=0.3019, pruned_loss=0.06809, over 4274528.52 frames. ], batch size: 702, lr: 2.86e-03, grad_scale: 16.0 2023-06-27 14:21:24,738 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1827072.0, ans=0.1 2023-06-27 14:22:16,261 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1827192.0, ans=0.5 2023-06-27 14:22:21,339 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1827252.0, ans=0.0 2023-06-27 14:22:39,168 INFO [train.py:996] (2/4) Epoch 10, batch 30100, loss[loss=0.1954, simple_loss=0.2663, pruned_loss=0.06219, over 21785.00 frames. ], tot_loss[loss=0.218, simple_loss=0.3008, pruned_loss=0.06759, over 4266008.36 frames. ], batch size: 102, lr: 2.86e-03, grad_scale: 16.0 2023-06-27 14:22:58,227 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.865e+02 7.541e+02 1.187e+03 1.645e+03 3.691e+03, threshold=2.374e+03, percent-clipped=12.0 2023-06-27 14:23:02,770 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1827372.0, ans=0.0 2023-06-27 14:23:48,923 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1827492.0, ans=0.2 2023-06-27 14:23:52,312 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1827492.0, ans=0.1 2023-06-27 14:24:16,579 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1827612.0, ans=0.1 2023-06-27 14:24:17,471 INFO [train.py:996] (2/4) Epoch 10, batch 30150, loss[loss=0.228, simple_loss=0.3023, pruned_loss=0.07683, over 21425.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2966, pruned_loss=0.06873, over 4259240.22 frames. ], batch size: 159, lr: 2.86e-03, grad_scale: 16.0 2023-06-27 14:24:18,144 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1827612.0, ans=0.125 2023-06-27 14:24:28,677 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.06 vs. limit=15.0 2023-06-27 14:25:10,893 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.24 vs. limit=15.0 2023-06-27 14:25:48,238 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1827852.0, ans=0.125 2023-06-27 14:25:50,041 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1827852.0, ans=0.125 2023-06-27 14:26:02,898 INFO [train.py:996] (2/4) Epoch 10, batch 30200, loss[loss=0.2067, simple_loss=0.3053, pruned_loss=0.05398, over 21814.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2994, pruned_loss=0.06791, over 4265260.54 frames. ], batch size: 282, lr: 2.86e-03, grad_scale: 16.0 2023-06-27 14:26:03,640 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1827912.0, ans=0.2 2023-06-27 14:26:42,655 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.354e+02 6.809e+02 8.710e+02 1.204e+03 2.614e+03, threshold=1.742e+03, percent-clipped=2.0 2023-06-27 14:27:30,737 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.08 vs. limit=15.0 2023-06-27 14:27:49,854 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1828152.0, ans=0.2 2023-06-27 14:28:02,217 INFO [train.py:996] (2/4) Epoch 10, batch 30250, loss[loss=0.2619, simple_loss=0.3743, pruned_loss=0.07478, over 21790.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.3071, pruned_loss=0.07, over 4265655.21 frames. ], batch size: 332, lr: 2.86e-03, grad_scale: 16.0 2023-06-27 14:28:02,997 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1828212.0, ans=0.125 2023-06-27 14:28:17,597 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.99 vs. limit=15.0 2023-06-27 14:28:30,744 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.80 vs. limit=10.0 2023-06-27 14:28:35,214 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1828272.0, ans=0.1 2023-06-27 14:28:35,266 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1828272.0, ans=0.125 2023-06-27 14:28:38,358 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1828332.0, ans=0.1 2023-06-27 14:28:55,894 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1828392.0, ans=0.125 2023-06-27 14:29:38,360 INFO [train.py:996] (2/4) Epoch 10, batch 30300, loss[loss=0.1708, simple_loss=0.243, pruned_loss=0.04929, over 21283.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.304, pruned_loss=0.06983, over 4266836.60 frames. ], batch size: 176, lr: 2.86e-03, grad_scale: 16.0 2023-06-27 14:29:45,446 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1828512.0, ans=0.5 2023-06-27 14:29:47,753 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.27 vs. limit=10.0 2023-06-27 14:30:03,866 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.241e+02 6.596e+02 9.409e+02 1.315e+03 2.834e+03, threshold=1.882e+03, percent-clipped=10.0 2023-06-27 14:30:05,250 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.93 vs. limit=15.0 2023-06-27 14:31:27,452 INFO [train.py:996] (2/4) Epoch 10, batch 30350, loss[loss=0.1747, simple_loss=0.2366, pruned_loss=0.05634, over 21181.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.3056, pruned_loss=0.07132, over 4269246.64 frames. ], batch size: 159, lr: 2.86e-03, grad_scale: 16.0 2023-06-27 14:31:30,104 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=1828812.0, ans=15.0 2023-06-27 14:31:39,486 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1828812.0, ans=0.1 2023-06-27 14:32:03,845 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1828932.0, ans=0.125 2023-06-27 14:32:03,968 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1828932.0, ans=0.125 2023-06-27 14:32:08,283 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1828932.0, ans=0.0 2023-06-27 14:32:41,788 INFO [train.py:996] (2/4) Epoch 10, batch 30400, loss[loss=0.2019, simple_loss=0.2553, pruned_loss=0.07428, over 20217.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.3003, pruned_loss=0.06991, over 4260103.17 frames. ], batch size: 703, lr: 2.86e-03, grad_scale: 32.0 2023-06-27 14:33:08,723 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1829172.0, ans=0.2 2023-06-27 14:33:09,681 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.426e+02 7.954e+02 1.288e+03 1.926e+03 4.132e+03, threshold=2.577e+03, percent-clipped=26.0 2023-06-27 14:33:14,478 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 14:33:52,106 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=1829352.0, ans=0.05 2023-06-27 14:34:08,239 INFO [train.py:996] (2/4) Epoch 10, batch 30450, loss[loss=0.2581, simple_loss=0.3807, pruned_loss=0.06777, over 19852.00 frames. ], tot_loss[loss=0.22, simple_loss=0.3006, pruned_loss=0.06971, over 4201070.70 frames. ], batch size: 702, lr: 2.86e-03, grad_scale: 16.0 2023-06-27 14:34:26,426 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1829472.0, ans=0.125 2023-06-27 14:34:30,513 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1829472.0, ans=0.2 2023-06-27 14:34:50,675 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1829532.0, ans=0.1 2023-06-27 14:34:55,111 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1829532.0, ans=0.125 2023-06-27 14:37:28,504 INFO [train.py:996] (2/4) Epoch 11, batch 0, loss[loss=0.1992, simple_loss=0.2711, pruned_loss=0.0636, over 21680.00 frames. ], tot_loss[loss=0.1992, simple_loss=0.2711, pruned_loss=0.0636, over 21680.00 frames. ], batch size: 282, lr: 2.72e-03, grad_scale: 32.0 2023-06-27 14:37:28,505 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-27 14:37:44,730 INFO [train.py:1028] (2/4) Epoch 11, validation: loss=0.2445, simple_loss=0.3464, pruned_loss=0.07127, over 1796401.00 frames. 2023-06-27 14:37:44,731 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 23731MB 2023-06-27 14:37:54,978 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1829676.0, ans=0.0 2023-06-27 14:38:23,072 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.704e+02 1.606e+03 2.605e+03 4.493e+03 1.142e+04, threshold=5.209e+03, percent-clipped=50.0 2023-06-27 14:38:46,283 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1829796.0, ans=0.125 2023-06-27 14:38:59,699 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.62 vs. limit=22.5 2023-06-27 14:39:20,446 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1829916.0, ans=0.2 2023-06-27 14:39:22,690 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=1829916.0, ans=15.0 2023-06-27 14:39:26,627 INFO [train.py:996] (2/4) Epoch 11, batch 50, loss[loss=0.2034, simple_loss=0.2739, pruned_loss=0.06641, over 21834.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.3127, pruned_loss=0.07317, over 971898.56 frames. ], batch size: 98, lr: 2.72e-03, grad_scale: 16.0 2023-06-27 14:40:12,102 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 14:40:13,567 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 14:40:48,477 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1830216.0, ans=0.125 2023-06-27 14:40:54,756 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1830216.0, ans=0.125 2023-06-27 14:41:08,873 INFO [train.py:996] (2/4) Epoch 11, batch 100, loss[loss=0.1816, simple_loss=0.2522, pruned_loss=0.05552, over 21902.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3215, pruned_loss=0.07256, over 1702062.79 frames. ], batch size: 98, lr: 2.72e-03, grad_scale: 16.0 2023-06-27 14:41:46,184 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.133e+02 5.871e+02 7.705e+02 1.160e+03 1.899e+03, threshold=1.541e+03, percent-clipped=0.0 2023-06-27 14:41:48,456 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1830396.0, ans=0.05 2023-06-27 14:41:48,461 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1830396.0, ans=0.1 2023-06-27 14:42:37,460 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=5.52 vs. limit=22.5 2023-06-27 14:42:51,596 INFO [train.py:996] (2/4) Epoch 11, batch 150, loss[loss=0.2328, simple_loss=0.3279, pruned_loss=0.06883, over 21768.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3199, pruned_loss=0.07268, over 2267065.27 frames. ], batch size: 298, lr: 2.72e-03, grad_scale: 16.0 2023-06-27 14:43:07,019 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1830636.0, ans=0.125 2023-06-27 14:43:13,747 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 14:44:33,956 INFO [train.py:996] (2/4) Epoch 11, batch 200, loss[loss=0.2391, simple_loss=0.3048, pruned_loss=0.08673, over 21852.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.3183, pruned_loss=0.07136, over 2704257.81 frames. ], batch size: 441, lr: 2.72e-03, grad_scale: 16.0 2023-06-27 14:44:35,150 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.10 vs. limit=6.0 2023-06-27 14:44:36,364 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1830876.0, ans=0.125 2023-06-27 14:45:11,943 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.129e+02 7.270e+02 1.005e+03 1.466e+03 4.683e+03, threshold=2.010e+03, percent-clipped=22.0 2023-06-27 14:46:12,243 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1831116.0, ans=0.0 2023-06-27 14:46:18,451 INFO [train.py:996] (2/4) Epoch 11, batch 250, loss[loss=0.2194, simple_loss=0.3199, pruned_loss=0.05943, over 21839.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.3123, pruned_loss=0.07004, over 3059008.50 frames. ], batch size: 371, lr: 2.72e-03, grad_scale: 16.0 2023-06-27 14:46:19,161 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=1831176.0, ans=0.05 2023-06-27 14:46:37,780 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=1831236.0, ans=15.0 2023-06-27 14:46:40,160 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1831236.0, ans=0.0 2023-06-27 14:46:41,778 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1831236.0, ans=0.125 2023-06-27 14:48:01,914 INFO [train.py:996] (2/4) Epoch 11, batch 300, loss[loss=0.1841, simple_loss=0.2552, pruned_loss=0.05652, over 21586.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.3042, pruned_loss=0.06864, over 3326969.65 frames. ], batch size: 263, lr: 2.72e-03, grad_scale: 16.0 2023-06-27 14:48:11,597 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.51 vs. limit=15.0 2023-06-27 14:48:40,913 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.322e+02 6.333e+02 9.156e+02 1.285e+03 2.394e+03, threshold=1.831e+03, percent-clipped=6.0 2023-06-27 14:48:49,191 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.66 vs. limit=15.0 2023-06-27 14:48:50,457 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 14:49:01,422 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.93 vs. limit=6.0 2023-06-27 14:49:47,313 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.92 vs. limit=22.5 2023-06-27 14:49:47,627 INFO [train.py:996] (2/4) Epoch 11, batch 350, loss[loss=0.1894, simple_loss=0.2834, pruned_loss=0.04769, over 21297.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2956, pruned_loss=0.06775, over 3538979.16 frames. ], batch size: 176, lr: 2.72e-03, grad_scale: 16.0 2023-06-27 14:50:01,048 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.62 vs. limit=15.0 2023-06-27 14:50:05,477 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1831836.0, ans=0.0 2023-06-27 14:50:07,060 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1831836.0, ans=0.125 2023-06-27 14:50:11,269 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.88 vs. limit=15.0 2023-06-27 14:50:14,031 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1831836.0, ans=0.0 2023-06-27 14:50:15,720 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1831836.0, ans=0.09899494936611666 2023-06-27 14:50:19,601 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.41 vs. limit=15.0 2023-06-27 14:50:25,269 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1831896.0, ans=0.125 2023-06-27 14:50:34,698 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1831896.0, ans=0.125 2023-06-27 14:51:29,099 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1832076.0, ans=0.125 2023-06-27 14:51:29,181 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1832076.0, ans=0.0 2023-06-27 14:51:30,054 INFO [train.py:996] (2/4) Epoch 11, batch 400, loss[loss=0.1788, simple_loss=0.244, pruned_loss=0.05678, over 21282.00 frames. ], tot_loss[loss=0.213, simple_loss=0.2926, pruned_loss=0.06669, over 3694261.55 frames. ], batch size: 144, lr: 2.72e-03, grad_scale: 32.0 2023-06-27 14:51:35,593 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1832076.0, ans=0.0 2023-06-27 14:52:09,628 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.211e+02 7.477e+02 1.167e+03 1.835e+03 4.227e+03, threshold=2.334e+03, percent-clipped=25.0 2023-06-27 14:53:12,790 INFO [train.py:996] (2/4) Epoch 11, batch 450, loss[loss=0.2154, simple_loss=0.2701, pruned_loss=0.08032, over 21308.00 frames. ], tot_loss[loss=0.2097, simple_loss=0.2879, pruned_loss=0.06575, over 3821213.00 frames. ], batch size: 473, lr: 2.72e-03, grad_scale: 16.0 2023-06-27 14:53:49,337 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.39 vs. limit=15.0 2023-06-27 14:54:48,186 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1832616.0, ans=0.125 2023-06-27 14:54:57,313 INFO [train.py:996] (2/4) Epoch 11, batch 500, loss[loss=0.2435, simple_loss=0.339, pruned_loss=0.07401, over 21788.00 frames. ], tot_loss[loss=0.2096, simple_loss=0.2888, pruned_loss=0.06515, over 3929482.25 frames. ], batch size: 282, lr: 2.72e-03, grad_scale: 16.0 2023-06-27 14:54:59,599 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1832676.0, ans=0.0 2023-06-27 14:55:09,536 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1832676.0, ans=0.0 2023-06-27 14:55:13,241 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 14:55:31,481 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.58 vs. limit=6.0 2023-06-27 14:55:37,224 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.096e+02 9.470e+02 1.676e+03 2.580e+03 4.364e+03, threshold=3.351e+03, percent-clipped=30.0 2023-06-27 14:55:50,498 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1832796.0, ans=0.125 2023-06-27 14:56:03,291 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1832856.0, ans=0.125 2023-06-27 14:56:24,888 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1832916.0, ans=0.125 2023-06-27 14:56:39,109 INFO [train.py:996] (2/4) Epoch 11, batch 550, loss[loss=0.2511, simple_loss=0.3796, pruned_loss=0.06128, over 20712.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.2928, pruned_loss=0.06387, over 4000359.45 frames. ], batch size: 607, lr: 2.72e-03, grad_scale: 16.0 2023-06-27 14:56:47,793 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1832976.0, ans=0.125 2023-06-27 14:57:47,426 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1833156.0, ans=0.2 2023-06-27 14:57:57,234 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_ff2.min_abs, batch_count=1833156.0, ans=0.1 2023-06-27 14:58:22,120 INFO [train.py:996] (2/4) Epoch 11, batch 600, loss[loss=0.211, simple_loss=0.2757, pruned_loss=0.07315, over 21849.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.299, pruned_loss=0.06484, over 4065831.07 frames. ], batch size: 98, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 14:59:00,930 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.101e+02 6.551e+02 9.996e+02 1.452e+03 3.285e+03, threshold=1.999e+03, percent-clipped=0.0 2023-06-27 15:00:03,728 INFO [train.py:996] (2/4) Epoch 11, batch 650, loss[loss=0.2147, simple_loss=0.3411, pruned_loss=0.04411, over 20735.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.3001, pruned_loss=0.0647, over 4111855.66 frames. ], batch size: 608, lr: 2.71e-03, grad_scale: 8.0 2023-06-27 15:00:11,044 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1833576.0, ans=0.0 2023-06-27 15:01:01,591 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 15:01:26,332 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.97 vs. limit=6.0 2023-06-27 15:01:30,777 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1833816.0, ans=0.2 2023-06-27 15:01:37,529 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1833816.0, ans=0.0 2023-06-27 15:01:39,987 INFO [train.py:996] (2/4) Epoch 11, batch 700, loss[loss=0.2078, simple_loss=0.2807, pruned_loss=0.06746, over 21786.00 frames. ], tot_loss[loss=0.2149, simple_loss=0.2989, pruned_loss=0.06543, over 4157794.58 frames. ], batch size: 332, lr: 2.71e-03, grad_scale: 8.0 2023-06-27 15:01:59,050 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.64 vs. limit=15.0 2023-06-27 15:02:26,189 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.643e+02 7.317e+02 1.195e+03 1.924e+03 5.182e+03, threshold=2.390e+03, percent-clipped=22.0 2023-06-27 15:02:35,792 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.32 vs. limit=12.0 2023-06-27 15:02:36,954 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1833996.0, ans=0.2 2023-06-27 15:03:02,363 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1834056.0, ans=0.0 2023-06-27 15:03:07,542 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 15:03:25,613 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1834176.0, ans=0.2 2023-06-27 15:03:26,530 INFO [train.py:996] (2/4) Epoch 11, batch 750, loss[loss=0.2083, simple_loss=0.2759, pruned_loss=0.07041, over 21750.00 frames. ], tot_loss[loss=0.2142, simple_loss=0.2957, pruned_loss=0.06631, over 4192799.58 frames. ], batch size: 316, lr: 2.71e-03, grad_scale: 8.0 2023-06-27 15:03:28,564 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1834176.0, ans=0.015 2023-06-27 15:03:32,155 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1834176.0, ans=0.1 2023-06-27 15:03:43,892 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1834236.0, ans=0.125 2023-06-27 15:04:21,637 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.95 vs. limit=22.5 2023-06-27 15:05:09,840 INFO [train.py:996] (2/4) Epoch 11, batch 800, loss[loss=0.1848, simple_loss=0.2631, pruned_loss=0.05323, over 21726.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2942, pruned_loss=0.06691, over 4196718.57 frames. ], batch size: 124, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:05:48,614 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1834596.0, ans=0.125 2023-06-27 15:05:51,246 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.162e+02 6.690e+02 1.036e+03 1.625e+03 3.290e+03, threshold=2.071e+03, percent-clipped=5.0 2023-06-27 15:06:30,057 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1834716.0, ans=0.0 2023-06-27 15:06:53,168 INFO [train.py:996] (2/4) Epoch 11, batch 850, loss[loss=0.2101, simple_loss=0.286, pruned_loss=0.06711, over 21466.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2921, pruned_loss=0.06743, over 4224091.20 frames. ], batch size: 131, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:07:08,598 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.43 vs. limit=6.0 2023-06-27 15:07:22,682 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1834836.0, ans=0.1 2023-06-27 15:08:32,833 INFO [train.py:996] (2/4) Epoch 11, batch 900, loss[loss=0.2008, simple_loss=0.2971, pruned_loss=0.05223, over 21839.00 frames. ], tot_loss[loss=0.2106, simple_loss=0.288, pruned_loss=0.06662, over 4244867.01 frames. ], batch size: 371, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:09:02,977 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1835136.0, ans=0.1 2023-06-27 15:09:15,872 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1835196.0, ans=0.125 2023-06-27 15:09:18,553 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.069e+02 6.963e+02 1.051e+03 1.568e+03 3.283e+03, threshold=2.103e+03, percent-clipped=8.0 2023-06-27 15:10:10,476 INFO [train.py:996] (2/4) Epoch 11, batch 950, loss[loss=0.232, simple_loss=0.365, pruned_loss=0.04952, over 19734.00 frames. ], tot_loss[loss=0.2087, simple_loss=0.2858, pruned_loss=0.0658, over 4250466.28 frames. ], batch size: 702, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:10:57,044 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1835496.0, ans=0.0 2023-06-27 15:11:02,196 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1835496.0, ans=0.125 2023-06-27 15:11:53,127 INFO [train.py:996] (2/4) Epoch 11, batch 1000, loss[loss=0.2389, simple_loss=0.3158, pruned_loss=0.08099, over 21818.00 frames. ], tot_loss[loss=0.2088, simple_loss=0.2859, pruned_loss=0.06583, over 4262035.53 frames. ], batch size: 118, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:11:58,772 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1835676.0, ans=0.0 2023-06-27 15:12:44,392 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.633e+02 7.396e+02 1.258e+03 1.842e+03 3.420e+03, threshold=2.515e+03, percent-clipped=20.0 2023-06-27 15:12:52,649 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.07 vs. limit=12.0 2023-06-27 15:13:22,123 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1835916.0, ans=0.0 2023-06-27 15:13:36,706 INFO [train.py:996] (2/4) Epoch 11, batch 1050, loss[loss=0.2373, simple_loss=0.3143, pruned_loss=0.08011, over 21321.00 frames. ], tot_loss[loss=0.2086, simple_loss=0.286, pruned_loss=0.06562, over 4267831.10 frames. ], batch size: 143, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:13:45,211 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1835976.0, ans=0.1 2023-06-27 15:14:22,720 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1836036.0, ans=0.1 2023-06-27 15:14:26,244 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1836096.0, ans=0.125 2023-06-27 15:14:34,813 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1836096.0, ans=0.125 2023-06-27 15:14:59,798 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1836156.0, ans=0.125 2023-06-27 15:15:03,432 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1836216.0, ans=0.2 2023-06-27 15:15:26,372 INFO [train.py:996] (2/4) Epoch 11, batch 1100, loss[loss=0.2114, simple_loss=0.2826, pruned_loss=0.07013, over 21890.00 frames. ], tot_loss[loss=0.2089, simple_loss=0.2868, pruned_loss=0.06547, over 4271694.40 frames. ], batch size: 351, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:15:46,580 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.89 vs. limit=22.5 2023-06-27 15:16:13,024 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.192e+02 8.562e+02 1.240e+03 1.886e+03 2.880e+03, threshold=2.480e+03, percent-clipped=5.0 2023-06-27 15:16:30,030 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.38 vs. limit=15.0 2023-06-27 15:17:09,916 INFO [train.py:996] (2/4) Epoch 11, batch 1150, loss[loss=0.2259, simple_loss=0.3029, pruned_loss=0.0744, over 21315.00 frames. ], tot_loss[loss=0.21, simple_loss=0.2891, pruned_loss=0.06551, over 4285963.22 frames. ], batch size: 194, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:17:27,401 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.81 vs. limit=22.5 2023-06-27 15:18:05,257 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1836696.0, ans=0.125 2023-06-27 15:18:06,832 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1836696.0, ans=0.1 2023-06-27 15:18:08,382 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 15:18:18,345 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1836756.0, ans=0.125 2023-06-27 15:18:53,540 INFO [train.py:996] (2/4) Epoch 11, batch 1200, loss[loss=0.211, simple_loss=0.2913, pruned_loss=0.06537, over 21829.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.2906, pruned_loss=0.06588, over 4286030.96 frames. ], batch size: 282, lr: 2.71e-03, grad_scale: 32.0 2023-06-27 15:19:34,743 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1836936.0, ans=0.0 2023-06-27 15:19:36,091 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1836996.0, ans=0.0 2023-06-27 15:19:47,124 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.628e+02 7.428e+02 1.142e+03 1.630e+03 3.056e+03, threshold=2.284e+03, percent-clipped=6.0 2023-06-27 15:20:23,751 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.60 vs. limit=10.0 2023-06-27 15:20:31,220 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1837116.0, ans=0.125 2023-06-27 15:20:37,526 INFO [train.py:996] (2/4) Epoch 11, batch 1250, loss[loss=0.2346, simple_loss=0.3084, pruned_loss=0.08037, over 21199.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.2936, pruned_loss=0.06706, over 4291372.35 frames. ], batch size: 143, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:20:43,324 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1837176.0, ans=0.125 2023-06-27 15:20:51,311 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1837176.0, ans=0.125 2023-06-27 15:21:09,643 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1837236.0, ans=0.125 2023-06-27 15:22:21,881 INFO [train.py:996] (2/4) Epoch 11, batch 1300, loss[loss=0.2063, simple_loss=0.2934, pruned_loss=0.05965, over 21421.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2953, pruned_loss=0.06788, over 4296310.60 frames. ], batch size: 211, lr: 2.71e-03, grad_scale: 8.0 2023-06-27 15:22:24,231 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1837476.0, ans=0.0 2023-06-27 15:22:30,902 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1837476.0, ans=0.125 2023-06-27 15:23:16,825 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.358e+02 6.400e+02 8.214e+02 1.269e+03 2.290e+03, threshold=1.643e+03, percent-clipped=1.0 2023-06-27 15:23:22,946 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.69 vs. limit=15.0 2023-06-27 15:24:02,287 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1837716.0, ans=0.125 2023-06-27 15:24:11,936 INFO [train.py:996] (2/4) Epoch 11, batch 1350, loss[loss=0.1763, simple_loss=0.2249, pruned_loss=0.06384, over 19951.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2945, pruned_loss=0.06788, over 4293290.15 frames. ], batch size: 703, lr: 2.71e-03, grad_scale: 8.0 2023-06-27 15:24:44,624 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1837836.0, ans=0.95 2023-06-27 15:25:30,607 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.85 vs. limit=15.0 2023-06-27 15:25:53,592 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1838016.0, ans=0.125 2023-06-27 15:25:56,193 INFO [train.py:996] (2/4) Epoch 11, batch 1400, loss[loss=0.2454, simple_loss=0.3361, pruned_loss=0.0774, over 21477.00 frames. ], tot_loss[loss=0.215, simple_loss=0.2935, pruned_loss=0.06829, over 4286360.09 frames. ], batch size: 471, lr: 2.71e-03, grad_scale: 8.0 2023-06-27 15:25:56,840 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1838076.0, ans=0.125 2023-06-27 15:26:21,281 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.64 vs. limit=15.0 2023-06-27 15:26:42,485 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1838196.0, ans=10.0 2023-06-27 15:26:46,610 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.209e+02 7.064e+02 1.087e+03 1.603e+03 3.118e+03, threshold=2.174e+03, percent-clipped=20.0 2023-06-27 15:27:03,566 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1838256.0, ans=0.2 2023-06-27 15:27:10,237 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1838256.0, ans=0.125 2023-06-27 15:27:15,111 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1838256.0, ans=0.125 2023-06-27 15:27:28,642 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1838316.0, ans=0.0 2023-06-27 15:27:39,802 INFO [train.py:996] (2/4) Epoch 11, batch 1450, loss[loss=0.2251, simple_loss=0.3, pruned_loss=0.07514, over 21678.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2919, pruned_loss=0.06839, over 4288979.67 frames. ], batch size: 332, lr: 2.71e-03, grad_scale: 8.0 2023-06-27 15:27:50,878 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.93 vs. limit=22.5 2023-06-27 15:28:12,332 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1838436.0, ans=0.125 2023-06-27 15:28:36,336 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1838496.0, ans=0.125 2023-06-27 15:28:48,675 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.65 vs. limit=15.0 2023-06-27 15:29:01,101 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1838616.0, ans=0.1 2023-06-27 15:29:09,654 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1838616.0, ans=0.125 2023-06-27 15:29:28,832 INFO [train.py:996] (2/4) Epoch 11, batch 1500, loss[loss=0.1919, simple_loss=0.269, pruned_loss=0.05743, over 21926.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2933, pruned_loss=0.06901, over 4280744.99 frames. ], batch size: 316, lr: 2.71e-03, grad_scale: 8.0 2023-06-27 15:29:41,370 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1838676.0, ans=0.04949747468305833 2023-06-27 15:30:14,625 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.898e+02 7.080e+02 9.690e+02 1.530e+03 3.266e+03, threshold=1.938e+03, percent-clipped=8.0 2023-06-27 15:30:15,375 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1838796.0, ans=0.125 2023-06-27 15:30:20,421 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_ff3.min_abs, batch_count=1838796.0, ans=0.2 2023-06-27 15:30:36,129 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1838856.0, ans=0.2 2023-06-27 15:30:52,895 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1838916.0, ans=0.125 2023-06-27 15:31:14,247 INFO [train.py:996] (2/4) Epoch 11, batch 1550, loss[loss=0.1891, simple_loss=0.2534, pruned_loss=0.06239, over 21550.00 frames. ], tot_loss[loss=0.2129, simple_loss=0.2904, pruned_loss=0.06766, over 4273516.75 frames. ], batch size: 441, lr: 2.71e-03, grad_scale: 8.0 2023-06-27 15:31:31,289 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1839036.0, ans=0.125 2023-06-27 15:32:28,427 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1839156.0, ans=0.125 2023-06-27 15:32:33,876 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1839156.0, ans=0.0 2023-06-27 15:32:37,523 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.21 vs. limit=10.0 2023-06-27 15:32:54,668 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.35 vs. limit=6.0 2023-06-27 15:33:01,773 INFO [train.py:996] (2/4) Epoch 11, batch 1600, loss[loss=0.1781, simple_loss=0.2436, pruned_loss=0.05635, over 21198.00 frames. ], tot_loss[loss=0.2113, simple_loss=0.2893, pruned_loss=0.06663, over 4280277.20 frames. ], batch size: 159, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:33:03,004 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.45 vs. limit=15.0 2023-06-27 15:33:28,678 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1839336.0, ans=0.125 2023-06-27 15:33:35,588 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1839336.0, ans=0.0 2023-06-27 15:33:40,965 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1839396.0, ans=0.2 2023-06-27 15:33:53,892 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.998e+02 6.555e+02 8.833e+02 1.502e+03 3.809e+03, threshold=1.767e+03, percent-clipped=10.0 2023-06-27 15:34:17,968 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.07 vs. limit=15.0 2023-06-27 15:34:37,574 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1839516.0, ans=0.2 2023-06-27 15:34:48,937 INFO [train.py:996] (2/4) Epoch 11, batch 1650, loss[loss=0.2231, simple_loss=0.3118, pruned_loss=0.06722, over 20934.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.2899, pruned_loss=0.06646, over 4274859.71 frames. ], batch size: 607, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:35:02,605 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1839576.0, ans=0.1 2023-06-27 15:35:52,324 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1839696.0, ans=0.95 2023-06-27 15:36:20,635 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.19 vs. limit=22.5 2023-06-27 15:36:37,006 INFO [train.py:996] (2/4) Epoch 11, batch 1700, loss[loss=0.2366, simple_loss=0.3276, pruned_loss=0.07284, over 21610.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.2933, pruned_loss=0.06716, over 4276743.53 frames. ], batch size: 441, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:37:35,056 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.435e+02 5.947e+02 9.216e+02 1.351e+03 2.792e+03, threshold=1.843e+03, percent-clipped=11.0 2023-06-27 15:37:52,914 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1840056.0, ans=0.0 2023-06-27 15:38:20,577 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1840116.0, ans=0.1 2023-06-27 15:38:29,413 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1840176.0, ans=0.125 2023-06-27 15:38:30,373 INFO [train.py:996] (2/4) Epoch 11, batch 1750, loss[loss=0.1525, simple_loss=0.2207, pruned_loss=0.04217, over 21780.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.2926, pruned_loss=0.06637, over 4270506.37 frames. ], batch size: 124, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:38:51,982 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1840236.0, ans=0.125 2023-06-27 15:39:04,288 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1840236.0, ans=0.0 2023-06-27 15:40:08,635 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1840416.0, ans=0.125 2023-06-27 15:40:22,768 INFO [train.py:996] (2/4) Epoch 11, batch 1800, loss[loss=0.2237, simple_loss=0.3041, pruned_loss=0.07168, over 21344.00 frames. ], tot_loss[loss=0.2092, simple_loss=0.2901, pruned_loss=0.06412, over 4261857.72 frames. ], batch size: 549, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:40:25,234 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1840476.0, ans=0.125 2023-06-27 15:40:25,284 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1840476.0, ans=0.125 2023-06-27 15:40:25,328 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1840476.0, ans=0.125 2023-06-27 15:40:30,445 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1840476.0, ans=0.0 2023-06-27 15:41:13,919 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.102e+02 6.830e+02 1.090e+03 1.802e+03 4.605e+03, threshold=2.180e+03, percent-clipped=19.0 2023-06-27 15:41:40,274 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1840656.0, ans=0.1 2023-06-27 15:42:09,052 INFO [train.py:996] (2/4) Epoch 11, batch 1850, loss[loss=0.1994, simple_loss=0.2808, pruned_loss=0.05897, over 21411.00 frames. ], tot_loss[loss=0.2065, simple_loss=0.2888, pruned_loss=0.06209, over 4262089.81 frames. ], batch size: 131, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:42:13,610 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.41 vs. limit=15.0 2023-06-27 15:42:53,417 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1840896.0, ans=0.1 2023-06-27 15:43:21,209 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=3.96 vs. limit=15.0 2023-06-27 15:43:30,703 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1841016.0, ans=0.04949747468305833 2023-06-27 15:43:44,172 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1841016.0, ans=0.0 2023-06-27 15:43:51,237 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1841016.0, ans=0.1 2023-06-27 15:43:53,668 INFO [train.py:996] (2/4) Epoch 11, batch 1900, loss[loss=0.18, simple_loss=0.2674, pruned_loss=0.04634, over 21814.00 frames. ], tot_loss[loss=0.208, simple_loss=0.2908, pruned_loss=0.06263, over 4264555.32 frames. ], batch size: 282, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:44:42,330 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1841196.0, ans=0.125 2023-06-27 15:44:43,247 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.131e+02 8.434e+02 1.477e+03 2.094e+03 4.159e+03, threshold=2.954e+03, percent-clipped=22.0 2023-06-27 15:45:39,130 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1841316.0, ans=0.125 2023-06-27 15:45:41,635 INFO [train.py:996] (2/4) Epoch 11, batch 1950, loss[loss=0.1924, simple_loss=0.2536, pruned_loss=0.06566, over 21907.00 frames. ], tot_loss[loss=0.2062, simple_loss=0.2883, pruned_loss=0.06205, over 4261964.55 frames. ], batch size: 125, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:46:25,754 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1841496.0, ans=0.0 2023-06-27 15:46:29,487 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=20.12 vs. limit=15.0 2023-06-27 15:46:56,959 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1841556.0, ans=0.125 2023-06-27 15:47:26,630 INFO [train.py:996] (2/4) Epoch 11, batch 2000, loss[loss=0.2322, simple_loss=0.3098, pruned_loss=0.07733, over 19923.00 frames. ], tot_loss[loss=0.2042, simple_loss=0.285, pruned_loss=0.06168, over 4267422.56 frames. ], batch size: 703, lr: 2.71e-03, grad_scale: 32.0 2023-06-27 15:47:41,251 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.03 vs. limit=15.0 2023-06-27 15:48:02,468 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1841736.0, ans=0.0 2023-06-27 15:48:13,537 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.255e+02 7.614e+02 1.079e+03 2.039e+03 3.848e+03, threshold=2.158e+03, percent-clipped=8.0 2023-06-27 15:48:31,223 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.14 vs. limit=15.0 2023-06-27 15:48:40,376 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1841856.0, ans=0.125 2023-06-27 15:49:09,578 INFO [train.py:996] (2/4) Epoch 11, batch 2050, loss[loss=0.1969, simple_loss=0.2848, pruned_loss=0.05455, over 21462.00 frames. ], tot_loss[loss=0.2056, simple_loss=0.2856, pruned_loss=0.06282, over 4275944.93 frames. ], batch size: 131, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:49:43,994 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1842036.0, ans=0.125 2023-06-27 15:49:57,542 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1842096.0, ans=0.1 2023-06-27 15:50:10,478 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.44 vs. limit=6.0 2023-06-27 15:50:34,214 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.98 vs. limit=15.0 2023-06-27 15:50:59,239 INFO [train.py:996] (2/4) Epoch 11, batch 2100, loss[loss=0.2584, simple_loss=0.3388, pruned_loss=0.08898, over 21777.00 frames. ], tot_loss[loss=0.2088, simple_loss=0.2885, pruned_loss=0.06453, over 4282515.69 frames. ], batch size: 414, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:51:03,204 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1842276.0, ans=0.125 2023-06-27 15:51:33,281 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1842396.0, ans=0.5 2023-06-27 15:51:46,354 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.397e+02 7.542e+02 1.130e+03 1.676e+03 4.140e+03, threshold=2.259e+03, percent-clipped=14.0 2023-06-27 15:51:49,050 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1842396.0, ans=0.125 2023-06-27 15:52:44,214 INFO [train.py:996] (2/4) Epoch 11, batch 2150, loss[loss=0.1969, simple_loss=0.2625, pruned_loss=0.06565, over 21743.00 frames. ], tot_loss[loss=0.2137, simple_loss=0.2925, pruned_loss=0.06745, over 4279433.81 frames. ], batch size: 124, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:53:05,499 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1842636.0, ans=0.1 2023-06-27 15:53:19,918 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.33 vs. limit=6.0 2023-06-27 15:53:29,929 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.56 vs. limit=12.0 2023-06-27 15:53:35,616 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1842696.0, ans=0.015 2023-06-27 15:54:29,178 INFO [train.py:996] (2/4) Epoch 11, batch 2200, loss[loss=0.2049, simple_loss=0.286, pruned_loss=0.06195, over 21817.00 frames. ], tot_loss[loss=0.2156, simple_loss=0.2951, pruned_loss=0.06803, over 4280719.60 frames. ], batch size: 124, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:54:53,888 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1842936.0, ans=0.0 2023-06-27 15:55:16,679 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.261e+02 6.339e+02 9.896e+02 1.686e+03 3.946e+03, threshold=1.979e+03, percent-clipped=15.0 2023-06-27 15:55:58,195 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1843116.0, ans=0.0 2023-06-27 15:56:14,356 INFO [train.py:996] (2/4) Epoch 11, batch 2250, loss[loss=0.2225, simple_loss=0.3477, pruned_loss=0.04867, over 20825.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2962, pruned_loss=0.06731, over 4277760.47 frames. ], batch size: 608, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:56:24,094 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.21 vs. limit=12.0 2023-06-27 15:56:59,073 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.13 vs. limit=22.5 2023-06-27 15:57:08,487 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1843356.0, ans=0.1 2023-06-27 15:57:31,068 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.44 vs. limit=22.5 2023-06-27 15:57:52,268 INFO [train.py:996] (2/4) Epoch 11, batch 2300, loss[loss=0.2533, simple_loss=0.3345, pruned_loss=0.08598, over 21570.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.293, pruned_loss=0.0666, over 4281438.80 frames. ], batch size: 471, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:58:39,381 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.760e+02 6.436e+02 1.038e+03 1.737e+03 5.031e+03, threshold=2.076e+03, percent-clipped=15.0 2023-06-27 15:58:39,994 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1843596.0, ans=0.0 2023-06-27 15:58:43,327 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1843596.0, ans=0.2 2023-06-27 15:58:46,750 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 15:59:36,635 INFO [train.py:996] (2/4) Epoch 11, batch 2350, loss[loss=0.2138, simple_loss=0.2858, pruned_loss=0.07091, over 21647.00 frames. ], tot_loss[loss=0.2087, simple_loss=0.2861, pruned_loss=0.06565, over 4272088.99 frames. ], batch size: 332, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:59:48,894 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1843776.0, ans=0.125 2023-06-27 16:00:29,706 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.67 vs. limit=15.0 2023-06-27 16:00:39,767 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.91 vs. limit=15.0 2023-06-27 16:01:21,955 INFO [train.py:996] (2/4) Epoch 11, batch 2400, loss[loss=0.1992, simple_loss=0.2755, pruned_loss=0.06146, over 21083.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2874, pruned_loss=0.06673, over 4263787.66 frames. ], batch size: 607, lr: 2.71e-03, grad_scale: 32.0 2023-06-27 16:01:27,724 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1844076.0, ans=0.125 2023-06-27 16:01:54,399 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.57 vs. limit=15.0 2023-06-27 16:02:21,960 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.317e+02 6.915e+02 1.084e+03 1.714e+03 3.712e+03, threshold=2.167e+03, percent-clipped=11.0 2023-06-27 16:02:24,385 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1844196.0, ans=0.0 2023-06-27 16:03:07,403 INFO [train.py:996] (2/4) Epoch 11, batch 2450, loss[loss=0.2056, simple_loss=0.2807, pruned_loss=0.06529, over 21822.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2914, pruned_loss=0.06956, over 4271509.42 frames. ], batch size: 317, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 16:03:56,249 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_ff3.min_abs, batch_count=1844496.0, ans=0.2 2023-06-27 16:04:14,751 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1844556.0, ans=0.125 2023-06-27 16:04:24,273 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1844556.0, ans=0.0 2023-06-27 16:04:49,999 INFO [train.py:996] (2/4) Epoch 11, batch 2500, loss[loss=0.2112, simple_loss=0.2897, pruned_loss=0.06634, over 21448.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.2879, pruned_loss=0.0692, over 4266956.96 frames. ], batch size: 211, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 16:05:08,669 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1844736.0, ans=0.125 2023-06-27 16:05:22,062 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1844736.0, ans=0.1 2023-06-27 16:05:43,704 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.461e+02 7.979e+02 1.093e+03 1.704e+03 3.202e+03, threshold=2.185e+03, percent-clipped=12.0 2023-06-27 16:06:23,382 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.82 vs. limit=15.0 2023-06-27 16:06:34,028 INFO [train.py:996] (2/4) Epoch 11, batch 2550, loss[loss=0.2308, simple_loss=0.3055, pruned_loss=0.07807, over 21642.00 frames. ], tot_loss[loss=0.2129, simple_loss=0.2879, pruned_loss=0.06898, over 4273385.65 frames. ], batch size: 263, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 16:06:44,354 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_ff2.min_abs, batch_count=1844976.0, ans=0.1 2023-06-27 16:06:47,919 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1844976.0, ans=0.125 2023-06-27 16:07:06,597 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1845036.0, ans=0.0 2023-06-27 16:07:07,271 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.22 vs. limit=15.0 2023-06-27 16:07:08,771 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.71 vs. limit=15.0 2023-06-27 16:08:18,040 INFO [train.py:996] (2/4) Epoch 11, batch 2600, loss[loss=0.2027, simple_loss=0.2766, pruned_loss=0.06438, over 21754.00 frames. ], tot_loss[loss=0.2133, simple_loss=0.2889, pruned_loss=0.06888, over 4266929.34 frames. ], batch size: 247, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 16:09:02,639 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1845396.0, ans=0.125 2023-06-27 16:09:12,269 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.122e+02 7.338e+02 1.284e+03 1.915e+03 4.312e+03, threshold=2.567e+03, percent-clipped=18.0 2023-06-27 16:09:45,083 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1845516.0, ans=0.125 2023-06-27 16:09:58,132 INFO [train.py:996] (2/4) Epoch 11, batch 2650, loss[loss=0.1761, simple_loss=0.2538, pruned_loss=0.04919, over 21619.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.2904, pruned_loss=0.06962, over 4271883.47 frames. ], batch size: 263, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 16:10:14,473 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 16:10:57,417 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=1845696.0, ans=10.0 2023-06-27 16:11:43,795 INFO [train.py:996] (2/4) Epoch 11, batch 2700, loss[loss=0.1996, simple_loss=0.27, pruned_loss=0.06456, over 21659.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.2889, pruned_loss=0.06876, over 4266720.76 frames. ], batch size: 247, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 16:11:49,304 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1845876.0, ans=0.0 2023-06-27 16:11:49,919 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.12 vs. limit=12.0 2023-06-27 16:12:37,364 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1845996.0, ans=0.0 2023-06-27 16:12:43,568 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.729e+02 6.625e+02 9.246e+02 1.409e+03 2.648e+03, threshold=1.849e+03, percent-clipped=2.0 2023-06-27 16:13:25,027 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.26 vs. limit=15.0 2023-06-27 16:13:28,877 INFO [train.py:996] (2/4) Epoch 11, batch 2750, loss[loss=0.2056, simple_loss=0.2957, pruned_loss=0.05776, over 21624.00 frames. ], tot_loss[loss=0.2129, simple_loss=0.2885, pruned_loss=0.06868, over 4266153.54 frames. ], batch size: 263, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 16:14:13,742 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1846236.0, ans=0.125 2023-06-27 16:14:17,120 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_na.min_abs, batch_count=1846296.0, ans=0.02 2023-06-27 16:14:26,971 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.72 vs. limit=5.0 2023-06-27 16:14:36,460 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 16:14:45,604 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.15 vs. limit=15.0 2023-06-27 16:14:50,379 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1846356.0, ans=0.0 2023-06-27 16:15:15,719 INFO [train.py:996] (2/4) Epoch 11, batch 2800, loss[loss=0.2808, simple_loss=0.3672, pruned_loss=0.09718, over 21641.00 frames. ], tot_loss[loss=0.2156, simple_loss=0.2914, pruned_loss=0.06986, over 4272676.15 frames. ], batch size: 414, lr: 2.71e-03, grad_scale: 32.0 2023-06-27 16:15:49,840 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1846536.0, ans=0.125 2023-06-27 16:15:51,457 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1846536.0, ans=0.125 2023-06-27 16:16:18,333 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.904e+02 7.981e+02 1.210e+03 1.745e+03 3.756e+03, threshold=2.419e+03, percent-clipped=24.0 2023-06-27 16:16:27,791 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1846656.0, ans=0.125 2023-06-27 16:17:03,354 INFO [train.py:996] (2/4) Epoch 11, batch 2850, loss[loss=0.2201, simple_loss=0.3498, pruned_loss=0.04524, over 19718.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2923, pruned_loss=0.0701, over 4264560.43 frames. ], batch size: 703, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 16:17:48,958 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1846836.0, ans=0.125 2023-06-27 16:17:59,062 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1846896.0, ans=0.125 2023-06-27 16:18:03,585 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1846896.0, ans=0.125 2023-06-27 16:18:05,768 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.19 vs. limit=6.0 2023-06-27 16:18:10,133 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1846956.0, ans=0.125 2023-06-27 16:18:15,688 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.93 vs. limit=10.0 2023-06-27 16:18:17,080 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1846956.0, ans=0.0 2023-06-27 16:18:20,218 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1846956.0, ans=0.0 2023-06-27 16:18:27,027 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1847016.0, ans=0.125 2023-06-27 16:18:41,457 INFO [train.py:996] (2/4) Epoch 11, batch 2900, loss[loss=0.2243, simple_loss=0.2976, pruned_loss=0.07552, over 21931.00 frames. ], tot_loss[loss=0.2156, simple_loss=0.2914, pruned_loss=0.06989, over 4275851.74 frames. ], batch size: 107, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 16:18:59,652 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1847076.0, ans=0.1 2023-06-27 16:18:59,665 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1847076.0, ans=0.125 2023-06-27 16:19:15,853 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1847136.0, ans=0.0 2023-06-27 16:19:31,288 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1847196.0, ans=0.0 2023-06-27 16:19:45,514 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.568e+02 6.840e+02 9.553e+02 1.645e+03 3.808e+03, threshold=1.911e+03, percent-clipped=8.0 2023-06-27 16:20:25,234 INFO [train.py:996] (2/4) Epoch 11, batch 2950, loss[loss=0.2385, simple_loss=0.3323, pruned_loss=0.07237, over 21816.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2934, pruned_loss=0.06963, over 4277306.42 frames. ], batch size: 351, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 16:20:49,843 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.15 vs. limit=15.0 2023-06-27 16:21:35,780 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1847556.0, ans=0.0 2023-06-27 16:21:56,619 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1847616.0, ans=0.1 2023-06-27 16:22:14,900 INFO [train.py:996] (2/4) Epoch 11, batch 3000, loss[loss=0.2341, simple_loss=0.3169, pruned_loss=0.0756, over 21762.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.2979, pruned_loss=0.07002, over 4282399.23 frames. ], batch size: 332, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 16:22:14,901 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-27 16:22:32,732 INFO [zipformer.py:1728] (2/4) name=encoder.encoders.4.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([2.4202, 3.9880, 3.5867, 2.4747], device='cuda:2') 2023-06-27 16:22:35,503 INFO [train.py:1028] (2/4) Epoch 11, validation: loss=0.2528, simple_loss=0.3433, pruned_loss=0.08109, over 1796401.00 frames. 2023-06-27 16:22:35,503 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 23731MB 2023-06-27 16:22:46,769 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1847676.0, ans=0.125 2023-06-27 16:23:00,535 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1847736.0, ans=0.0 2023-06-27 16:23:02,140 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 16:23:09,245 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1847736.0, ans=0.125 2023-06-27 16:23:27,451 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.016e+02 6.559e+02 9.881e+02 1.581e+03 3.511e+03, threshold=1.976e+03, percent-clipped=15.0 2023-06-27 16:23:39,949 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1847856.0, ans=0.125 2023-06-27 16:23:41,463 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1847856.0, ans=0.07 2023-06-27 16:24:16,780 INFO [train.py:996] (2/4) Epoch 11, batch 3050, loss[loss=0.1758, simple_loss=0.258, pruned_loss=0.04676, over 21824.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.2986, pruned_loss=0.06891, over 4272507.49 frames. ], batch size: 282, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 16:25:29,763 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=6.83 vs. limit=22.5 2023-06-27 16:25:58,036 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.67 vs. limit=15.0 2023-06-27 16:26:03,803 INFO [train.py:996] (2/4) Epoch 11, batch 3100, loss[loss=0.2122, simple_loss=0.2888, pruned_loss=0.06779, over 20843.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2973, pruned_loss=0.06795, over 4264748.22 frames. ], batch size: 607, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 16:26:20,912 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1848276.0, ans=0.125 2023-06-27 16:26:54,556 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.218e+02 9.868e+02 1.604e+03 2.316e+03 3.970e+03, threshold=3.207e+03, percent-clipped=39.0 2023-06-27 16:27:03,654 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1848456.0, ans=0.2 2023-06-27 16:27:54,287 INFO [train.py:996] (2/4) Epoch 11, batch 3150, loss[loss=0.2511, simple_loss=0.3212, pruned_loss=0.09047, over 21266.00 frames. ], tot_loss[loss=0.215, simple_loss=0.2963, pruned_loss=0.06682, over 4265393.11 frames. ], batch size: 143, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 16:27:58,852 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.45 vs. limit=6.0 2023-06-27 16:28:12,261 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1848636.0, ans=0.04949747468305833 2023-06-27 16:28:24,673 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.53 vs. limit=15.0 2023-06-27 16:29:40,792 INFO [train.py:996] (2/4) Epoch 11, batch 3200, loss[loss=0.2366, simple_loss=0.3174, pruned_loss=0.07788, over 21354.00 frames. ], tot_loss[loss=0.2164, simple_loss=0.2984, pruned_loss=0.06715, over 4267440.15 frames. ], batch size: 211, lr: 2.70e-03, grad_scale: 32.0 2023-06-27 16:30:13,487 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1848936.0, ans=0.0 2023-06-27 16:30:36,914 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1848996.0, ans=0.0 2023-06-27 16:30:40,532 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1848996.0, ans=0.07 2023-06-27 16:30:42,998 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.812e+02 8.312e+02 1.188e+03 1.817e+03 3.495e+03, threshold=2.376e+03, percent-clipped=3.0 2023-06-27 16:31:25,352 INFO [train.py:996] (2/4) Epoch 11, batch 3250, loss[loss=0.1986, simple_loss=0.251, pruned_loss=0.0731, over 20065.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2987, pruned_loss=0.06807, over 4276331.38 frames. ], batch size: 702, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 16:32:47,633 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1849356.0, ans=0.07 2023-06-27 16:33:03,093 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1849416.0, ans=0.125 2023-06-27 16:33:11,139 INFO [train.py:996] (2/4) Epoch 11, batch 3300, loss[loss=0.254, simple_loss=0.3244, pruned_loss=0.09181, over 21339.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2956, pruned_loss=0.06866, over 4274703.63 frames. ], batch size: 471, lr: 2.70e-03, grad_scale: 8.0 2023-06-27 16:34:04,598 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1849596.0, ans=0.125 2023-06-27 16:34:15,386 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.316e+02 6.747e+02 1.095e+03 2.044e+03 4.676e+03, threshold=2.190e+03, percent-clipped=15.0 2023-06-27 16:34:17,700 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1849656.0, ans=0.125 2023-06-27 16:34:17,841 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1849656.0, ans=0.125 2023-06-27 16:34:50,735 INFO [train.py:996] (2/4) Epoch 11, batch 3350, loss[loss=0.2184, simple_loss=0.2907, pruned_loss=0.07301, over 21374.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2971, pruned_loss=0.06856, over 4273505.59 frames. ], batch size: 211, lr: 2.70e-03, grad_scale: 8.0 2023-06-27 16:34:51,416 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1849776.0, ans=0.1 2023-06-27 16:35:05,335 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1849776.0, ans=0.0 2023-06-27 16:35:43,562 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1849896.0, ans=0.1 2023-06-27 16:36:35,682 INFO [train.py:996] (2/4) Epoch 11, batch 3400, loss[loss=0.195, simple_loss=0.2752, pruned_loss=0.05744, over 21836.00 frames. ], tot_loss[loss=0.2172, simple_loss=0.2972, pruned_loss=0.06866, over 4279679.32 frames. ], batch size: 107, lr: 2.70e-03, grad_scale: 8.0 2023-06-27 16:37:43,326 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.448e+02 6.438e+02 9.614e+02 1.434e+03 2.571e+03, threshold=1.923e+03, percent-clipped=1.0 2023-06-27 16:38:24,832 INFO [train.py:996] (2/4) Epoch 11, batch 3450, loss[loss=0.194, simple_loss=0.2602, pruned_loss=0.06392, over 21808.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.293, pruned_loss=0.06793, over 4279488.14 frames. ], batch size: 352, lr: 2.70e-03, grad_scale: 8.0 2023-06-27 16:39:04,649 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 16:39:35,792 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 16:40:15,589 INFO [train.py:996] (2/4) Epoch 11, batch 3500, loss[loss=0.2962, simple_loss=0.3619, pruned_loss=0.1153, over 21476.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.3006, pruned_loss=0.07144, over 4283278.62 frames. ], batch size: 471, lr: 2.70e-03, grad_scale: 8.0 2023-06-27 16:40:34,484 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1850676.0, ans=0.125 2023-06-27 16:41:14,140 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.683e+02 8.214e+02 1.340e+03 2.218e+03 5.014e+03, threshold=2.681e+03, percent-clipped=29.0 2023-06-27 16:41:24,930 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1850856.0, ans=0.0 2023-06-27 16:41:28,314 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1850856.0, ans=0.125 2023-06-27 16:41:42,705 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.11 vs. limit=22.5 2023-06-27 16:42:05,014 INFO [train.py:996] (2/4) Epoch 11, batch 3550, loss[loss=0.2082, simple_loss=0.2797, pruned_loss=0.06834, over 21829.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.3034, pruned_loss=0.07242, over 4272701.86 frames. ], batch size: 98, lr: 2.70e-03, grad_scale: 8.0 2023-06-27 16:43:49,656 INFO [train.py:996] (2/4) Epoch 11, batch 3600, loss[loss=0.2127, simple_loss=0.2774, pruned_loss=0.07396, over 21719.00 frames. ], tot_loss[loss=0.22, simple_loss=0.2972, pruned_loss=0.0714, over 4274340.11 frames. ], batch size: 112, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 16:43:59,085 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1851276.0, ans=0.2 2023-06-27 16:43:59,148 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1851276.0, ans=0.125 2023-06-27 16:44:37,554 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.12 vs. limit=15.0 2023-06-27 16:44:44,959 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.210e+02 6.431e+02 1.048e+03 1.688e+03 3.904e+03, threshold=2.095e+03, percent-clipped=4.0 2023-06-27 16:44:50,669 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1851456.0, ans=0.125 2023-06-27 16:45:36,121 INFO [train.py:996] (2/4) Epoch 11, batch 3650, loss[loss=0.1836, simple_loss=0.2769, pruned_loss=0.04513, over 19979.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.2972, pruned_loss=0.07148, over 4272434.54 frames. ], batch size: 703, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 16:45:53,440 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.99 vs. limit=15.0 2023-06-27 16:46:01,094 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1851636.0, ans=0.1 2023-06-27 16:46:05,060 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.35 vs. limit=6.0 2023-06-27 16:46:06,082 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1851636.0, ans=0.025 2023-06-27 16:47:19,949 INFO [train.py:996] (2/4) Epoch 11, batch 3700, loss[loss=0.2329, simple_loss=0.3125, pruned_loss=0.07672, over 21763.00 frames. ], tot_loss[loss=0.2183, simple_loss=0.2954, pruned_loss=0.07065, over 4274397.24 frames. ], batch size: 441, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 16:47:32,641 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1851876.0, ans=0.0 2023-06-27 16:47:47,665 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.86 vs. limit=15.0 2023-06-27 16:48:11,180 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff3.min_abs, batch_count=1851996.0, ans=0.2 2023-06-27 16:48:13,914 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.881e+02 6.744e+02 1.016e+03 1.702e+03 3.129e+03, threshold=2.032e+03, percent-clipped=14.0 2023-06-27 16:48:16,479 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1852056.0, ans=0.035 2023-06-27 16:49:04,970 INFO [train.py:996] (2/4) Epoch 11, batch 3750, loss[loss=0.2638, simple_loss=0.3425, pruned_loss=0.0926, over 21426.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.2963, pruned_loss=0.07142, over 4280901.62 frames. ], batch size: 549, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 16:49:17,831 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1852176.0, ans=0.125 2023-06-27 16:49:58,475 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1852296.0, ans=0.0 2023-06-27 16:50:08,471 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1852356.0, ans=0.1 2023-06-27 16:50:18,608 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1852356.0, ans=0.1 2023-06-27 16:50:25,400 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1852356.0, ans=0.0 2023-06-27 16:50:28,868 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1852416.0, ans=0.2 2023-06-27 16:50:35,945 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1852416.0, ans=0.125 2023-06-27 16:50:49,322 INFO [train.py:996] (2/4) Epoch 11, batch 3800, loss[loss=0.2088, simple_loss=0.302, pruned_loss=0.05776, over 19865.00 frames. ], tot_loss[loss=0.216, simple_loss=0.293, pruned_loss=0.06952, over 4279013.36 frames. ], batch size: 702, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 16:50:53,741 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1852476.0, ans=0.125 2023-06-27 16:51:05,445 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1852536.0, ans=0.125 2023-06-27 16:51:18,379 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 16:51:23,309 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1852596.0, ans=0.125 2023-06-27 16:51:47,705 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.392e+02 7.087e+02 9.540e+02 1.301e+03 2.936e+03, threshold=1.908e+03, percent-clipped=6.0 2023-06-27 16:52:32,395 INFO [train.py:996] (2/4) Epoch 11, batch 3850, loss[loss=0.2077, simple_loss=0.2793, pruned_loss=0.06809, over 20175.00 frames. ], tot_loss[loss=0.2149, simple_loss=0.2904, pruned_loss=0.06967, over 4277372.34 frames. ], batch size: 703, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 16:52:56,203 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1852836.0, ans=0.2 2023-06-27 16:53:01,089 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1852836.0, ans=0.125 2023-06-27 16:53:01,173 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1852836.0, ans=0.125 2023-06-27 16:53:26,050 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1852956.0, ans=0.0 2023-06-27 16:53:42,478 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1852956.0, ans=0.1 2023-06-27 16:53:50,518 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1852956.0, ans=0.09899494936611666 2023-06-27 16:54:14,685 INFO [train.py:996] (2/4) Epoch 11, batch 3900, loss[loss=0.2036, simple_loss=0.2711, pruned_loss=0.06799, over 21699.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.2862, pruned_loss=0.06964, over 4288172.87 frames. ], batch size: 391, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 16:54:36,494 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.87 vs. limit=10.0 2023-06-27 16:54:40,632 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1853136.0, ans=0.1 2023-06-27 16:54:48,277 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.49 vs. limit=22.5 2023-06-27 16:55:09,164 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.347e+02 6.134e+02 8.883e+02 1.369e+03 3.236e+03, threshold=1.777e+03, percent-clipped=7.0 2023-06-27 16:55:20,478 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.13 vs. limit=10.0 2023-06-27 16:55:54,581 INFO [train.py:996] (2/4) Epoch 11, batch 3950, loss[loss=0.1865, simple_loss=0.257, pruned_loss=0.05805, over 21840.00 frames. ], tot_loss[loss=0.2129, simple_loss=0.2883, pruned_loss=0.06874, over 4286341.97 frames. ], batch size: 124, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 16:56:03,818 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1853376.0, ans=0.2 2023-06-27 16:56:29,100 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=1853496.0, ans=22.5 2023-06-27 16:56:50,194 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1853496.0, ans=0.1 2023-06-27 16:57:32,927 INFO [train.py:996] (2/4) Epoch 11, batch 4000, loss[loss=0.1887, simple_loss=0.2592, pruned_loss=0.05907, over 22009.00 frames. ], tot_loss[loss=0.2079, simple_loss=0.2839, pruned_loss=0.06594, over 4282266.86 frames. ], batch size: 103, lr: 2.70e-03, grad_scale: 32.0 2023-06-27 16:57:56,421 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1853736.0, ans=0.0 2023-06-27 16:57:58,042 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1853736.0, ans=0.125 2023-06-27 16:58:10,921 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1853796.0, ans=0.1 2023-06-27 16:58:37,197 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.700e+02 6.762e+02 1.217e+03 2.027e+03 5.671e+03, threshold=2.434e+03, percent-clipped=30.0 2023-06-27 16:58:59,065 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1853916.0, ans=0.125 2023-06-27 16:59:06,120 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1853916.0, ans=0.0 2023-06-27 16:59:17,771 INFO [train.py:996] (2/4) Epoch 11, batch 4050, loss[loss=0.2273, simple_loss=0.3227, pruned_loss=0.06593, over 19765.00 frames. ], tot_loss[loss=0.2055, simple_loss=0.2825, pruned_loss=0.06424, over 4271575.49 frames. ], batch size: 703, lr: 2.70e-03, grad_scale: 32.0 2023-06-27 16:59:36,824 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1854036.0, ans=0.0 2023-06-27 16:59:41,504 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1854036.0, ans=0.125 2023-06-27 16:59:41,511 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1854036.0, ans=0.2 2023-06-27 16:59:41,561 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1854036.0, ans=0.125 2023-06-27 16:59:53,378 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1854036.0, ans=0.0 2023-06-27 17:00:08,602 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1854096.0, ans=0.125 2023-06-27 17:00:56,826 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1854216.0, ans=0.0 2023-06-27 17:01:01,311 INFO [train.py:996] (2/4) Epoch 11, batch 4100, loss[loss=0.1848, simple_loss=0.2742, pruned_loss=0.04771, over 21517.00 frames. ], tot_loss[loss=0.2081, simple_loss=0.2847, pruned_loss=0.06579, over 4272041.49 frames. ], batch size: 195, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 17:01:12,684 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1854276.0, ans=0.0 2023-06-27 17:01:56,958 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1854396.0, ans=0.125 2023-06-27 17:02:11,182 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.335e+02 7.301e+02 1.093e+03 1.524e+03 3.311e+03, threshold=2.186e+03, percent-clipped=4.0 2023-06-27 17:02:41,237 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.37 vs. limit=15.0 2023-06-27 17:02:45,151 INFO [train.py:996] (2/4) Epoch 11, batch 4150, loss[loss=0.1784, simple_loss=0.2631, pruned_loss=0.0469, over 21452.00 frames. ], tot_loss[loss=0.2072, simple_loss=0.286, pruned_loss=0.06415, over 4277058.50 frames. ], batch size: 195, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 17:03:11,761 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1854636.0, ans=0.0 2023-06-27 17:03:49,320 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1854696.0, ans=0.125 2023-06-27 17:03:56,267 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1854756.0, ans=0.0 2023-06-27 17:04:14,041 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1854816.0, ans=0.0 2023-06-27 17:04:14,146 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1854816.0, ans=0.2 2023-06-27 17:04:27,387 INFO [train.py:996] (2/4) Epoch 11, batch 4200, loss[loss=0.2075, simple_loss=0.2816, pruned_loss=0.06673, over 21696.00 frames. ], tot_loss[loss=0.2068, simple_loss=0.2866, pruned_loss=0.06356, over 4275200.96 frames. ], batch size: 316, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 17:04:47,873 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.92 vs. limit=15.0 2023-06-27 17:04:57,472 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1854936.0, ans=0.125 2023-06-27 17:05:33,638 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=1855056.0, ans=0.025 2023-06-27 17:05:34,650 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.415e+02 6.010e+02 8.416e+02 1.376e+03 4.083e+03, threshold=1.683e+03, percent-clipped=10.0 2023-06-27 17:05:52,535 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1855116.0, ans=0.1 2023-06-27 17:06:14,239 INFO [train.py:996] (2/4) Epoch 11, batch 4250, loss[loss=0.2306, simple_loss=0.3125, pruned_loss=0.07431, over 21774.00 frames. ], tot_loss[loss=0.2101, simple_loss=0.2914, pruned_loss=0.06436, over 4272631.47 frames. ], batch size: 332, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 17:06:33,177 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1855176.0, ans=0.0 2023-06-27 17:07:30,639 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1855356.0, ans=0.125 2023-06-27 17:07:58,018 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1855416.0, ans=0.1 2023-06-27 17:08:00,649 INFO [train.py:996] (2/4) Epoch 11, batch 4300, loss[loss=0.2167, simple_loss=0.2996, pruned_loss=0.06687, over 21407.00 frames. ], tot_loss[loss=0.2149, simple_loss=0.2974, pruned_loss=0.06624, over 4272447.78 frames. ], batch size: 131, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 17:08:01,156 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1855476.0, ans=0.125 2023-06-27 17:08:19,891 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1855476.0, ans=0.125 2023-06-27 17:08:55,703 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.494e+02 7.202e+02 1.029e+03 1.570e+03 4.728e+03, threshold=2.058e+03, percent-clipped=18.0 2023-06-27 17:09:39,123 INFO [train.py:996] (2/4) Epoch 11, batch 4350, loss[loss=0.1798, simple_loss=0.2476, pruned_loss=0.05596, over 21445.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2961, pruned_loss=0.06556, over 4272865.64 frames. ], batch size: 211, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 17:10:05,129 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1855836.0, ans=0.0 2023-06-27 17:10:13,729 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1855836.0, ans=0.125 2023-06-27 17:10:39,689 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1855956.0, ans=0.125 2023-06-27 17:10:46,059 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1855956.0, ans=0.025 2023-06-27 17:11:01,418 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1855956.0, ans=0.0 2023-06-27 17:11:11,429 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1856016.0, ans=0.1 2023-06-27 17:11:29,255 INFO [train.py:996] (2/4) Epoch 11, batch 4400, loss[loss=0.2316, simple_loss=0.3142, pruned_loss=0.07447, over 21904.00 frames. ], tot_loss[loss=0.2118, simple_loss=0.2927, pruned_loss=0.06546, over 4261607.79 frames. ], batch size: 373, lr: 2.70e-03, grad_scale: 32.0 2023-06-27 17:11:29,937 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1856076.0, ans=0.0 2023-06-27 17:12:32,724 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.456e+02 7.940e+02 1.162e+03 1.682e+03 5.044e+03, threshold=2.325e+03, percent-clipped=15.0 2023-06-27 17:13:00,765 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.57 vs. limit=15.0 2023-06-27 17:13:05,143 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 17:13:14,922 INFO [train.py:996] (2/4) Epoch 11, batch 4450, loss[loss=0.2375, simple_loss=0.3313, pruned_loss=0.07182, over 21275.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.3002, pruned_loss=0.06696, over 4267036.10 frames. ], batch size: 548, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 17:13:18,607 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1856376.0, ans=0.125 2023-06-27 17:14:27,567 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.55 vs. limit=10.0 2023-06-27 17:14:50,670 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.16 vs. limit=15.0 2023-06-27 17:14:53,986 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.65 vs. limit=22.5 2023-06-27 17:14:59,764 INFO [train.py:996] (2/4) Epoch 11, batch 4500, loss[loss=0.2205, simple_loss=0.3221, pruned_loss=0.05944, over 20744.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.3023, pruned_loss=0.06837, over 4265439.28 frames. ], batch size: 608, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 17:15:28,830 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1856736.0, ans=0.125 2023-06-27 17:15:29,448 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.97 vs. limit=6.0 2023-06-27 17:16:01,161 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.839e+02 8.296e+02 1.426e+03 1.842e+03 5.527e+03, threshold=2.851e+03, percent-clipped=18.0 2023-06-27 17:16:01,806 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1856856.0, ans=0.125 2023-06-27 17:16:32,322 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1856916.0, ans=0.0 2023-06-27 17:16:38,284 INFO [train.py:996] (2/4) Epoch 11, batch 4550, loss[loss=0.2635, simple_loss=0.3401, pruned_loss=0.0934, over 21587.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.305, pruned_loss=0.06919, over 4270641.81 frames. ], batch size: 414, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 17:16:40,216 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_na.min_abs, batch_count=1856976.0, ans=0.02 2023-06-27 17:17:15,940 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1857036.0, ans=0.0 2023-06-27 17:18:21,929 INFO [train.py:996] (2/4) Epoch 11, batch 4600, loss[loss=0.1907, simple_loss=0.2726, pruned_loss=0.0544, over 21470.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.3071, pruned_loss=0.07057, over 4277194.03 frames. ], batch size: 211, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 17:18:28,970 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1857276.0, ans=0.125 2023-06-27 17:19:33,357 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.579e+02 7.640e+02 1.105e+03 1.523e+03 3.294e+03, threshold=2.209e+03, percent-clipped=1.0 2023-06-27 17:20:05,602 INFO [train.py:996] (2/4) Epoch 11, batch 4650, loss[loss=0.166, simple_loss=0.2453, pruned_loss=0.04333, over 21538.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.3014, pruned_loss=0.06909, over 4284224.20 frames. ], batch size: 131, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 17:20:22,319 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1857576.0, ans=0.125 2023-06-27 17:21:04,475 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1857696.0, ans=0.2 2023-06-27 17:21:49,625 INFO [train.py:996] (2/4) Epoch 11, batch 4700, loss[loss=0.2078, simple_loss=0.2712, pruned_loss=0.07218, over 21656.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2934, pruned_loss=0.06733, over 4287726.19 frames. ], batch size: 393, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 17:22:11,424 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1857936.0, ans=0.125 2023-06-27 17:22:11,999 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.58 vs. limit=15.0 2023-06-27 17:22:41,893 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.19 vs. limit=15.0 2023-06-27 17:22:47,034 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.63 vs. limit=15.0 2023-06-27 17:22:59,953 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.795e+02 6.918e+02 1.097e+03 1.707e+03 4.002e+03, threshold=2.193e+03, percent-clipped=11.0 2023-06-27 17:23:00,660 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1858056.0, ans=0.0 2023-06-27 17:23:25,483 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1858116.0, ans=0.0 2023-06-27 17:23:31,327 INFO [train.py:996] (2/4) Epoch 11, batch 4750, loss[loss=0.2044, simple_loss=0.2783, pruned_loss=0.06521, over 21420.00 frames. ], tot_loss[loss=0.2113, simple_loss=0.289, pruned_loss=0.06685, over 4281084.60 frames. ], batch size: 131, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 17:23:38,980 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1858176.0, ans=0.2 2023-06-27 17:23:52,927 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.04 vs. limit=22.5 2023-06-27 17:24:56,605 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1858356.0, ans=0.0 2023-06-27 17:25:20,794 INFO [train.py:996] (2/4) Epoch 11, batch 4800, loss[loss=0.2052, simple_loss=0.3081, pruned_loss=0.05114, over 21807.00 frames. ], tot_loss[loss=0.2122, simple_loss=0.2899, pruned_loss=0.06728, over 4281989.05 frames. ], batch size: 351, lr: 2.70e-03, grad_scale: 32.0 2023-06-27 17:25:54,704 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1858536.0, ans=0.2 2023-06-27 17:25:58,152 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1858536.0, ans=0.1 2023-06-27 17:26:10,020 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1858596.0, ans=0.0 2023-06-27 17:26:19,859 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1858596.0, ans=0.125 2023-06-27 17:26:28,650 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.577e+02 8.092e+02 1.102e+03 1.736e+03 3.587e+03, threshold=2.204e+03, percent-clipped=14.0 2023-06-27 17:26:53,997 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1858716.0, ans=0.125 2023-06-27 17:27:03,229 INFO [train.py:996] (2/4) Epoch 11, batch 4850, loss[loss=0.2448, simple_loss=0.3185, pruned_loss=0.08558, over 21524.00 frames. ], tot_loss[loss=0.2113, simple_loss=0.2883, pruned_loss=0.06716, over 4283568.50 frames. ], batch size: 471, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 17:27:07,111 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1858776.0, ans=0.125 2023-06-27 17:28:08,702 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1858956.0, ans=0.125 2023-06-27 17:28:08,744 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1858956.0, ans=0.1 2023-06-27 17:28:10,825 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.31 vs. limit=15.0 2023-06-27 17:28:18,641 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1858956.0, ans=0.0 2023-06-27 17:28:21,076 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.50 vs. limit=10.0 2023-06-27 17:28:41,956 INFO [train.py:996] (2/4) Epoch 11, batch 4900, loss[loss=0.2428, simple_loss=0.3072, pruned_loss=0.08919, over 21779.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2914, pruned_loss=0.06835, over 4287669.19 frames. ], batch size: 441, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 17:29:03,900 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1859076.0, ans=0.125 2023-06-27 17:29:07,520 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1859136.0, ans=0.0 2023-06-27 17:29:56,067 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.319e+02 7.657e+02 1.361e+03 1.915e+03 3.497e+03, threshold=2.723e+03, percent-clipped=17.0 2023-06-27 17:30:31,170 INFO [train.py:996] (2/4) Epoch 11, batch 4950, loss[loss=0.1694, simple_loss=0.2712, pruned_loss=0.03381, over 21735.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.2945, pruned_loss=0.06684, over 4287873.43 frames. ], batch size: 332, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 17:30:36,753 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1859376.0, ans=0.2 2023-06-27 17:30:44,167 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.45 vs. limit=15.0 2023-06-27 17:32:14,063 INFO [train.py:996] (2/4) Epoch 11, batch 5000, loss[loss=0.192, simple_loss=0.2534, pruned_loss=0.06526, over 20180.00 frames. ], tot_loss[loss=0.2109, simple_loss=0.293, pruned_loss=0.06446, over 4285278.55 frames. ], batch size: 703, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 17:32:42,615 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1859736.0, ans=0.125 2023-06-27 17:33:20,245 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.265e+02 5.938e+02 8.345e+02 1.344e+03 2.733e+03, threshold=1.669e+03, percent-clipped=1.0 2023-06-27 17:33:50,177 INFO [train.py:996] (2/4) Epoch 11, batch 5050, loss[loss=0.2413, simple_loss=0.307, pruned_loss=0.08786, over 21874.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.2932, pruned_loss=0.06609, over 4281888.37 frames. ], batch size: 391, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 17:34:30,404 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1860036.0, ans=0.125 2023-06-27 17:34:37,557 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1860096.0, ans=0.2 2023-06-27 17:34:38,096 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.74 vs. limit=15.0 2023-06-27 17:35:30,900 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1860216.0, ans=0.05 2023-06-27 17:35:33,544 INFO [train.py:996] (2/4) Epoch 11, batch 5100, loss[loss=0.1808, simple_loss=0.2692, pruned_loss=0.04621, over 21856.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.2922, pruned_loss=0.06658, over 4278121.12 frames. ], batch size: 371, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 17:35:50,793 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1860276.0, ans=0.125 2023-06-27 17:36:12,933 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1860336.0, ans=0.125 2023-06-27 17:36:47,573 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.357e+02 6.699e+02 8.715e+02 1.182e+03 3.007e+03, threshold=1.743e+03, percent-clipped=11.0 2023-06-27 17:37:23,084 INFO [train.py:996] (2/4) Epoch 11, batch 5150, loss[loss=0.2081, simple_loss=0.2809, pruned_loss=0.0676, over 21889.00 frames. ], tot_loss[loss=0.2119, simple_loss=0.2895, pruned_loss=0.06718, over 4284332.90 frames. ], batch size: 316, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 17:37:54,400 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1860636.0, ans=0.1 2023-06-27 17:39:12,476 INFO [train.py:996] (2/4) Epoch 11, batch 5200, loss[loss=0.2253, simple_loss=0.3294, pruned_loss=0.06059, over 21746.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.2914, pruned_loss=0.06749, over 4286448.99 frames. ], batch size: 332, lr: 2.69e-03, grad_scale: 32.0 2023-06-27 17:39:35,113 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1860936.0, ans=0.125 2023-06-27 17:39:40,550 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.66 vs. limit=15.0 2023-06-27 17:39:58,988 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=6.18 vs. limit=22.5 2023-06-27 17:40:17,758 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.467e+02 7.745e+02 1.179e+03 1.665e+03 4.294e+03, threshold=2.357e+03, percent-clipped=21.0 2023-06-27 17:40:37,301 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.45 vs. limit=15.0 2023-06-27 17:41:00,886 INFO [train.py:996] (2/4) Epoch 11, batch 5250, loss[loss=0.2031, simple_loss=0.2949, pruned_loss=0.05562, over 21588.00 frames. ], tot_loss[loss=0.2147, simple_loss=0.296, pruned_loss=0.06664, over 4283562.22 frames. ], batch size: 263, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 17:41:04,711 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1861176.0, ans=0.5 2023-06-27 17:41:17,185 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.01 vs. limit=12.0 2023-06-27 17:41:28,864 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1861236.0, ans=0.0 2023-06-27 17:41:32,471 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1861236.0, ans=0.125 2023-06-27 17:41:42,628 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.48 vs. limit=15.0 2023-06-27 17:41:56,984 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1861356.0, ans=0.125 2023-06-27 17:42:41,258 INFO [train.py:996] (2/4) Epoch 11, batch 5300, loss[loss=0.2331, simple_loss=0.3093, pruned_loss=0.0785, over 21961.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.2939, pruned_loss=0.06688, over 4289247.50 frames. ], batch size: 113, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 17:42:46,846 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1861476.0, ans=0.125 2023-06-27 17:43:03,481 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.07 vs. limit=15.0 2023-06-27 17:43:39,373 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.786e+02 7.949e+02 1.214e+03 1.979e+03 3.974e+03, threshold=2.428e+03, percent-clipped=14.0 2023-06-27 17:43:39,998 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1861656.0, ans=0.0 2023-06-27 17:44:04,514 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1861716.0, ans=0.125 2023-06-27 17:44:21,753 INFO [train.py:996] (2/4) Epoch 11, batch 5350, loss[loss=0.2445, simple_loss=0.3052, pruned_loss=0.09186, over 21736.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.293, pruned_loss=0.06814, over 4287859.24 frames. ], batch size: 473, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 17:44:29,850 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.04 vs. limit=15.0 2023-06-27 17:44:36,217 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.77 vs. limit=10.0 2023-06-27 17:44:58,750 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1861896.0, ans=0.125 2023-06-27 17:45:38,976 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1862016.0, ans=0.1 2023-06-27 17:46:05,872 INFO [train.py:996] (2/4) Epoch 11, batch 5400, loss[loss=0.1996, simple_loss=0.3025, pruned_loss=0.04837, over 20827.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.2911, pruned_loss=0.06833, over 4288901.04 frames. ], batch size: 607, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 17:46:13,558 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1862076.0, ans=0.125 2023-06-27 17:47:07,284 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.410e+02 6.450e+02 1.066e+03 1.376e+03 3.123e+03, threshold=2.132e+03, percent-clipped=3.0 2023-06-27 17:47:38,302 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.62 vs. limit=10.0 2023-06-27 17:47:47,653 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1862316.0, ans=0.125 2023-06-27 17:47:50,484 INFO [train.py:996] (2/4) Epoch 11, batch 5450, loss[loss=0.2336, simple_loss=0.3467, pruned_loss=0.06025, over 21636.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2934, pruned_loss=0.06688, over 4286231.46 frames. ], batch size: 389, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 17:47:59,619 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1862376.0, ans=0.125 2023-06-27 17:49:40,241 INFO [train.py:996] (2/4) Epoch 11, batch 5500, loss[loss=0.2091, simple_loss=0.309, pruned_loss=0.05457, over 21757.00 frames. ], tot_loss[loss=0.2147, simple_loss=0.2991, pruned_loss=0.0652, over 4278652.82 frames. ], batch size: 332, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 17:49:53,354 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.90 vs. limit=22.5 2023-06-27 17:50:43,549 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1862856.0, ans=0.0 2023-06-27 17:50:49,748 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.051e+02 7.572e+02 1.528e+03 2.313e+03 5.179e+03, threshold=3.055e+03, percent-clipped=29.0 2023-06-27 17:51:02,389 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1862916.0, ans=0.125 2023-06-27 17:51:08,485 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=3.99 vs. limit=15.0 2023-06-27 17:51:24,525 INFO [train.py:996] (2/4) Epoch 11, batch 5550, loss[loss=0.1592, simple_loss=0.237, pruned_loss=0.04069, over 21197.00 frames. ], tot_loss[loss=0.212, simple_loss=0.2984, pruned_loss=0.06278, over 4270621.83 frames. ], batch size: 159, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 17:51:39,321 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.46 vs. limit=15.0 2023-06-27 17:52:00,543 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1863036.0, ans=0.125 2023-06-27 17:52:15,888 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1863096.0, ans=0.125 2023-06-27 17:52:41,540 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1863156.0, ans=0.0 2023-06-27 17:52:55,075 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1863216.0, ans=0.0 2023-06-27 17:53:00,761 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.18 vs. limit=6.0 2023-06-27 17:53:04,454 INFO [train.py:996] (2/4) Epoch 11, batch 5600, loss[loss=0.2097, simple_loss=0.3, pruned_loss=0.05972, over 21404.00 frames. ], tot_loss[loss=0.2073, simple_loss=0.2958, pruned_loss=0.05934, over 4266342.35 frames. ], batch size: 194, lr: 2.69e-03, grad_scale: 32.0 2023-06-27 17:53:15,905 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.34 vs. limit=22.5 2023-06-27 17:53:37,685 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1863336.0, ans=0.0 2023-06-27 17:53:42,744 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1863336.0, ans=0.0 2023-06-27 17:54:06,214 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1863456.0, ans=0.1 2023-06-27 17:54:13,841 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.087e+02 7.294e+02 1.095e+03 1.659e+03 3.151e+03, threshold=2.190e+03, percent-clipped=1.0 2023-06-27 17:54:22,596 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1863456.0, ans=0.2 2023-06-27 17:54:22,617 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 17:54:22,657 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1863456.0, ans=0.07 2023-06-27 17:54:41,753 INFO [train.py:996] (2/4) Epoch 11, batch 5650, loss[loss=0.2133, simple_loss=0.2861, pruned_loss=0.07022, over 21878.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.299, pruned_loss=0.06171, over 4269749.45 frames. ], batch size: 124, lr: 2.69e-03, grad_scale: 32.0 2023-06-27 17:55:05,743 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.02 vs. limit=10.0 2023-06-27 17:56:19,729 INFO [train.py:996] (2/4) Epoch 11, batch 5700, loss[loss=0.1877, simple_loss=0.2651, pruned_loss=0.05516, over 21123.00 frames. ], tot_loss[loss=0.2125, simple_loss=0.2975, pruned_loss=0.06369, over 4275826.83 frames. ], batch size: 608, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 17:56:34,326 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1863876.0, ans=0.1 2023-06-27 17:57:32,517 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.653e+02 6.609e+02 9.381e+02 1.350e+03 3.463e+03, threshold=1.876e+03, percent-clipped=9.0 2023-06-27 17:57:41,643 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1864116.0, ans=0.035 2023-06-27 17:57:45,254 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1864116.0, ans=0.0 2023-06-27 17:58:13,666 INFO [train.py:996] (2/4) Epoch 11, batch 5750, loss[loss=0.2314, simple_loss=0.3191, pruned_loss=0.07191, over 21504.00 frames. ], tot_loss[loss=0.2091, simple_loss=0.2944, pruned_loss=0.0619, over 4270014.57 frames. ], batch size: 508, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 17:58:31,045 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1864236.0, ans=0.95 2023-06-27 17:58:31,084 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1864236.0, ans=0.2 2023-06-27 17:59:48,869 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1864416.0, ans=0.0 2023-06-27 17:59:56,960 INFO [train.py:996] (2/4) Epoch 11, batch 5800, loss[loss=0.257, simple_loss=0.3471, pruned_loss=0.08347, over 21533.00 frames. ], tot_loss[loss=0.2087, simple_loss=0.2958, pruned_loss=0.06079, over 4266913.21 frames. ], batch size: 471, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:01:04,529 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.579e+02 7.155e+02 1.088e+03 1.847e+03 4.141e+03, threshold=2.176e+03, percent-clipped=25.0 2023-06-27 18:01:16,783 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1864656.0, ans=0.1 2023-06-27 18:01:31,333 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1864716.0, ans=0.125 2023-06-27 18:01:41,152 INFO [train.py:996] (2/4) Epoch 11, batch 5850, loss[loss=0.1865, simple_loss=0.2964, pruned_loss=0.0383, over 21635.00 frames. ], tot_loss[loss=0.2034, simple_loss=0.2924, pruned_loss=0.05721, over 4263476.23 frames. ], batch size: 414, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:02:21,833 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1864896.0, ans=0.125 2023-06-27 18:02:44,914 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.10 vs. limit=6.0 2023-06-27 18:02:45,824 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1864956.0, ans=0.0 2023-06-27 18:03:17,828 INFO [train.py:996] (2/4) Epoch 11, batch 5900, loss[loss=0.2167, simple_loss=0.2848, pruned_loss=0.07427, over 21427.00 frames. ], tot_loss[loss=0.1953, simple_loss=0.2848, pruned_loss=0.0529, over 4264020.91 frames. ], batch size: 131, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:03:18,417 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1865076.0, ans=0.125 2023-06-27 18:03:34,657 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 18:04:28,082 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.367e+02 6.471e+02 9.679e+02 1.352e+03 2.438e+03, threshold=1.936e+03, percent-clipped=4.0 2023-06-27 18:04:54,761 INFO [train.py:996] (2/4) Epoch 11, batch 5950, loss[loss=0.212, simple_loss=0.2774, pruned_loss=0.07328, over 21686.00 frames. ], tot_loss[loss=0.1974, simple_loss=0.2832, pruned_loss=0.05581, over 4271075.79 frames. ], batch size: 414, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:05:28,167 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1865436.0, ans=0.0 2023-06-27 18:06:06,481 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 18:06:11,179 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1865556.0, ans=0.125 2023-06-27 18:06:11,688 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.12 vs. limit=15.0 2023-06-27 18:06:37,190 INFO [train.py:996] (2/4) Epoch 11, batch 6000, loss[loss=0.1953, simple_loss=0.2573, pruned_loss=0.06664, over 21251.00 frames. ], tot_loss[loss=0.1978, simple_loss=0.2786, pruned_loss=0.05857, over 4272773.58 frames. ], batch size: 159, lr: 2.69e-03, grad_scale: 32.0 2023-06-27 18:06:37,191 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-27 18:06:50,041 INFO [zipformer.py:1728] (2/4) name=encoder.encoders.0.layers.0.self_attn_weights, attn_weights_entropy = tensor([5.3644, 5.5038, 5.2410, 5.0209], device='cuda:2') 2023-06-27 18:06:52,618 INFO [zipformer.py:1728] (2/4) name=encoder.encoders.0.layers.1.self_attn_weights, attn_weights_entropy = tensor([4.5614, 4.0910, 4.2421, 3.7505], device='cuda:2') 2023-06-27 18:06:56,349 INFO [train.py:1028] (2/4) Epoch 11, validation: loss=0.2612, simple_loss=0.354, pruned_loss=0.08419, over 1796401.00 frames. 2023-06-27 18:06:56,350 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 23731MB 2023-06-27 18:07:14,571 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1865736.0, ans=0.125 2023-06-27 18:08:10,064 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.353e+02 5.907e+02 8.109e+02 1.325e+03 2.971e+03, threshold=1.622e+03, percent-clipped=7.0 2023-06-27 18:08:39,959 INFO [train.py:996] (2/4) Epoch 11, batch 6050, loss[loss=0.2026, simple_loss=0.2782, pruned_loss=0.06347, over 15999.00 frames. ], tot_loss[loss=0.1951, simple_loss=0.2727, pruned_loss=0.05869, over 4273841.71 frames. ], batch size: 60, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:10:02,423 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.65 vs. limit=22.5 2023-06-27 18:10:17,460 INFO [train.py:996] (2/4) Epoch 11, batch 6100, loss[loss=0.2081, simple_loss=0.28, pruned_loss=0.06805, over 21534.00 frames. ], tot_loss[loss=0.1946, simple_loss=0.2731, pruned_loss=0.058, over 4275216.85 frames. ], batch size: 548, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:11:20,819 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1866456.0, ans=0.0 2023-06-27 18:11:29,682 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.047e+02 7.065e+02 1.029e+03 1.365e+03 3.489e+03, threshold=2.059e+03, percent-clipped=16.0 2023-06-27 18:11:59,720 INFO [train.py:996] (2/4) Epoch 11, batch 6150, loss[loss=0.1957, simple_loss=0.2718, pruned_loss=0.05979, over 21906.00 frames. ], tot_loss[loss=0.1983, simple_loss=0.2765, pruned_loss=0.0601, over 4283309.45 frames. ], batch size: 98, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:13:38,547 INFO [train.py:996] (2/4) Epoch 11, batch 6200, loss[loss=0.267, simple_loss=0.3672, pruned_loss=0.08336, over 21769.00 frames. ], tot_loss[loss=0.2002, simple_loss=0.2793, pruned_loss=0.06056, over 4277232.70 frames. ], batch size: 391, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:14:49,761 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1867056.0, ans=0.1 2023-06-27 18:14:52,452 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.390e+02 7.354e+02 1.075e+03 1.607e+03 4.153e+03, threshold=2.150e+03, percent-clipped=10.0 2023-06-27 18:14:53,081 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1867056.0, ans=0.0 2023-06-27 18:15:05,546 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1867116.0, ans=0.1 2023-06-27 18:15:12,313 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1867116.0, ans=0.0 2023-06-27 18:15:18,549 INFO [train.py:996] (2/4) Epoch 11, batch 6250, loss[loss=0.199, simple_loss=0.2888, pruned_loss=0.0546, over 21277.00 frames. ], tot_loss[loss=0.2005, simple_loss=0.2831, pruned_loss=0.0589, over 4269112.09 frames. ], batch size: 159, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:15:50,189 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1867236.0, ans=0.125 2023-06-27 18:16:15,405 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1867296.0, ans=0.0 2023-06-27 18:16:22,384 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=1867356.0, ans=15.0 2023-06-27 18:17:10,376 INFO [train.py:996] (2/4) Epoch 11, batch 6300, loss[loss=0.2133, simple_loss=0.3009, pruned_loss=0.06282, over 21873.00 frames. ], tot_loss[loss=0.2028, simple_loss=0.2875, pruned_loss=0.05909, over 4275386.95 frames. ], batch size: 351, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:17:34,237 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.39 vs. limit=22.5 2023-06-27 18:17:34,318 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.39 vs. limit=10.0 2023-06-27 18:18:04,954 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1867596.0, ans=0.125 2023-06-27 18:18:17,774 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.254e+02 6.166e+02 8.295e+02 1.136e+03 2.739e+03, threshold=1.659e+03, percent-clipped=3.0 2023-06-27 18:18:25,290 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1867716.0, ans=0.1 2023-06-27 18:18:52,480 INFO [train.py:996] (2/4) Epoch 11, batch 6350, loss[loss=0.2115, simple_loss=0.2908, pruned_loss=0.06611, over 21434.00 frames. ], tot_loss[loss=0.2076, simple_loss=0.2899, pruned_loss=0.06265, over 4282806.73 frames. ], batch size: 211, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:19:18,539 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=6.89 vs. limit=22.5 2023-06-27 18:19:39,447 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1867896.0, ans=0.1 2023-06-27 18:20:25,258 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.53 vs. limit=15.0 2023-06-27 18:20:30,013 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1868016.0, ans=0.125 2023-06-27 18:20:40,586 INFO [train.py:996] (2/4) Epoch 11, batch 6400, loss[loss=0.2157, simple_loss=0.2923, pruned_loss=0.06955, over 21627.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2957, pruned_loss=0.06644, over 4283593.11 frames. ], batch size: 263, lr: 2.69e-03, grad_scale: 32.0 2023-06-27 18:20:45,009 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1868076.0, ans=0.0 2023-06-27 18:21:18,263 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1868196.0, ans=0.2 2023-06-27 18:21:45,658 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.12 vs. limit=10.0 2023-06-27 18:21:55,776 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.666e+02 7.590e+02 1.060e+03 1.570e+03 3.138e+03, threshold=2.120e+03, percent-clipped=19.0 2023-06-27 18:22:09,258 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1868316.0, ans=0.2 2023-06-27 18:22:12,684 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1868316.0, ans=0.125 2023-06-27 18:22:15,978 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1868316.0, ans=0.125 2023-06-27 18:22:23,552 INFO [train.py:996] (2/4) Epoch 11, batch 6450, loss[loss=0.2103, simple_loss=0.3025, pruned_loss=0.05906, over 21577.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.3003, pruned_loss=0.06674, over 4283802.11 frames. ], batch size: 389, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:22:51,799 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1868436.0, ans=0.125 2023-06-27 18:22:57,008 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1868436.0, ans=0.125 2023-06-27 18:23:55,549 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1868616.0, ans=0.0 2023-06-27 18:24:06,995 INFO [train.py:996] (2/4) Epoch 11, batch 6500, loss[loss=0.2368, simple_loss=0.2929, pruned_loss=0.09032, over 21347.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.295, pruned_loss=0.0659, over 4283882.25 frames. ], batch size: 507, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:24:25,613 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1868736.0, ans=0.125 2023-06-27 18:25:20,773 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=10.30 vs. limit=10.0 2023-06-27 18:25:20,918 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.722e+02 7.121e+02 1.016e+03 1.758e+03 3.430e+03, threshold=2.032e+03, percent-clipped=12.0 2023-06-27 18:25:21,602 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1868856.0, ans=0.2 2023-06-27 18:25:48,830 INFO [train.py:996] (2/4) Epoch 11, batch 6550, loss[loss=0.193, simple_loss=0.2736, pruned_loss=0.05618, over 21775.00 frames. ], tot_loss[loss=0.2123, simple_loss=0.2945, pruned_loss=0.06501, over 4285202.10 frames. ], batch size: 247, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:26:03,106 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.71 vs. limit=12.0 2023-06-27 18:26:39,452 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1869096.0, ans=0.125 2023-06-27 18:26:40,122 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.21 vs. limit=15.0 2023-06-27 18:27:31,156 INFO [train.py:996] (2/4) Epoch 11, batch 6600, loss[loss=0.1918, simple_loss=0.2576, pruned_loss=0.06297, over 21757.00 frames. ], tot_loss[loss=0.2084, simple_loss=0.2877, pruned_loss=0.06456, over 4265984.22 frames. ], batch size: 300, lr: 2.69e-03, grad_scale: 8.0 2023-06-27 18:27:55,926 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.85 vs. limit=15.0 2023-06-27 18:28:24,793 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1869396.0, ans=0.2 2023-06-27 18:28:28,335 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 18:28:50,866 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.168e+02 6.592e+02 1.007e+03 1.403e+03 3.039e+03, threshold=2.014e+03, percent-clipped=10.0 2023-06-27 18:29:06,971 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1869516.0, ans=0.0 2023-06-27 18:29:11,028 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.79 vs. limit=22.5 2023-06-27 18:29:12,963 INFO [train.py:996] (2/4) Epoch 11, batch 6650, loss[loss=0.2037, simple_loss=0.2756, pruned_loss=0.06589, over 21557.00 frames. ], tot_loss[loss=0.2025, simple_loss=0.2805, pruned_loss=0.06225, over 4266074.61 frames. ], batch size: 442, lr: 2.69e-03, grad_scale: 8.0 2023-06-27 18:29:29,771 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1869576.0, ans=0.0 2023-06-27 18:30:59,844 INFO [train.py:996] (2/4) Epoch 11, batch 6700, loss[loss=0.1936, simple_loss=0.2667, pruned_loss=0.06027, over 21655.00 frames. ], tot_loss[loss=0.1992, simple_loss=0.2751, pruned_loss=0.0616, over 4257565.43 frames. ], batch size: 247, lr: 2.69e-03, grad_scale: 8.0 2023-06-27 18:31:29,158 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.09 vs. limit=10.0 2023-06-27 18:32:16,559 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.214e+02 6.879e+02 9.707e+02 1.410e+03 2.811e+03, threshold=1.941e+03, percent-clipped=3.0 2023-06-27 18:32:23,588 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1870116.0, ans=0.125 2023-06-27 18:32:42,377 INFO [train.py:996] (2/4) Epoch 11, batch 6750, loss[loss=0.2037, simple_loss=0.2749, pruned_loss=0.06623, over 21820.00 frames. ], tot_loss[loss=0.1983, simple_loss=0.2737, pruned_loss=0.06151, over 4258724.26 frames. ], batch size: 351, lr: 2.69e-03, grad_scale: 8.0 2023-06-27 18:32:50,147 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1870176.0, ans=0.125 2023-06-27 18:33:40,599 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1870356.0, ans=0.1 2023-06-27 18:34:14,968 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1870416.0, ans=0.125 2023-06-27 18:34:23,484 INFO [train.py:996] (2/4) Epoch 11, batch 6800, loss[loss=0.2187, simple_loss=0.2793, pruned_loss=0.07911, over 21578.00 frames. ], tot_loss[loss=0.2011, simple_loss=0.2753, pruned_loss=0.06344, over 4268640.07 frames. ], batch size: 414, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:34:30,765 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1870476.0, ans=0.0 2023-06-27 18:34:55,739 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.79 vs. limit=15.0 2023-06-27 18:35:30,198 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1870656.0, ans=0.125 2023-06-27 18:35:33,806 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.21 vs. limit=15.0 2023-06-27 18:35:39,117 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.574e+02 7.159e+02 9.186e+02 1.470e+03 3.415e+03, threshold=1.837e+03, percent-clipped=10.0 2023-06-27 18:35:40,361 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.21 vs. limit=15.0 2023-06-27 18:35:46,833 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.98 vs. limit=10.0 2023-06-27 18:36:00,271 INFO [train.py:996] (2/4) Epoch 11, batch 6850, loss[loss=0.1921, simple_loss=0.2581, pruned_loss=0.06307, over 21216.00 frames. ], tot_loss[loss=0.2014, simple_loss=0.2748, pruned_loss=0.064, over 4264863.97 frames. ], batch size: 607, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:36:45,266 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1870896.0, ans=0.125 2023-06-27 18:37:43,674 INFO [train.py:996] (2/4) Epoch 11, batch 6900, loss[loss=0.2144, simple_loss=0.2764, pruned_loss=0.07618, over 21534.00 frames. ], tot_loss[loss=0.2026, simple_loss=0.2758, pruned_loss=0.06465, over 4272807.39 frames. ], batch size: 548, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:38:21,109 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1871136.0, ans=0.125 2023-06-27 18:38:40,726 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1871196.0, ans=0.0 2023-06-27 18:39:05,833 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.235e+02 7.048e+02 1.193e+03 1.711e+03 4.903e+03, threshold=2.385e+03, percent-clipped=22.0 2023-06-27 18:39:31,790 INFO [train.py:996] (2/4) Epoch 11, batch 6950, loss[loss=0.1823, simple_loss=0.2969, pruned_loss=0.0338, over 21279.00 frames. ], tot_loss[loss=0.2029, simple_loss=0.28, pruned_loss=0.06289, over 4278655.72 frames. ], batch size: 548, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:39:37,332 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1871376.0, ans=0.125 2023-06-27 18:39:49,092 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1871376.0, ans=0.1 2023-06-27 18:39:54,087 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1871436.0, ans=0.0 2023-06-27 18:39:55,908 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_ff2.min_abs, batch_count=1871436.0, ans=0.1 2023-06-27 18:39:55,958 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 18:39:59,363 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1871436.0, ans=0.125 2023-06-27 18:40:18,347 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.95 vs. limit=15.0 2023-06-27 18:41:02,524 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.19 vs. limit=15.0 2023-06-27 18:41:14,921 INFO [train.py:996] (2/4) Epoch 11, batch 7000, loss[loss=0.2381, simple_loss=0.3004, pruned_loss=0.08786, over 21698.00 frames. ], tot_loss[loss=0.2054, simple_loss=0.2815, pruned_loss=0.06468, over 4284256.27 frames. ], batch size: 441, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:41:37,425 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1871736.0, ans=0.125 2023-06-27 18:41:40,597 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1871736.0, ans=0.2 2023-06-27 18:42:31,980 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.526e+02 6.963e+02 9.301e+02 1.305e+03 2.856e+03, threshold=1.860e+03, percent-clipped=1.0 2023-06-27 18:42:42,287 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1871916.0, ans=0.2 2023-06-27 18:42:58,610 INFO [train.py:996] (2/4) Epoch 11, batch 7050, loss[loss=0.1679, simple_loss=0.2319, pruned_loss=0.05195, over 15947.00 frames. ], tot_loss[loss=0.2028, simple_loss=0.2786, pruned_loss=0.06347, over 4275880.41 frames. ], batch size: 60, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:43:28,539 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1872036.0, ans=0.0 2023-06-27 18:43:30,078 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1872036.0, ans=0.125 2023-06-27 18:43:54,257 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.90 vs. limit=12.0 2023-06-27 18:44:03,962 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1872156.0, ans=0.0 2023-06-27 18:44:36,433 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.69 vs. limit=10.0 2023-06-27 18:44:47,744 INFO [train.py:996] (2/4) Epoch 11, batch 7100, loss[loss=0.2045, simple_loss=0.2815, pruned_loss=0.06373, over 20727.00 frames. ], tot_loss[loss=0.2051, simple_loss=0.282, pruned_loss=0.06408, over 4277099.33 frames. ], batch size: 607, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:45:29,093 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1872396.0, ans=0.125 2023-06-27 18:45:30,836 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1872396.0, ans=0.125 2023-06-27 18:45:54,251 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1872456.0, ans=0.0 2023-06-27 18:46:03,439 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.854e+02 6.087e+02 7.876e+02 1.187e+03 3.248e+03, threshold=1.575e+03, percent-clipped=9.0 2023-06-27 18:46:30,036 INFO [train.py:996] (2/4) Epoch 11, batch 7150, loss[loss=0.2319, simple_loss=0.3121, pruned_loss=0.07586, over 21659.00 frames. ], tot_loss[loss=0.2047, simple_loss=0.2819, pruned_loss=0.06378, over 4275021.43 frames. ], batch size: 351, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:46:45,476 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1872576.0, ans=0.0 2023-06-27 18:47:16,101 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 18:47:21,268 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1872696.0, ans=0.1 2023-06-27 18:47:21,753 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.72 vs. limit=10.0 2023-06-27 18:47:31,088 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1872696.0, ans=0.0 2023-06-27 18:48:18,348 INFO [train.py:996] (2/4) Epoch 11, batch 7200, loss[loss=0.1962, simple_loss=0.2664, pruned_loss=0.06301, over 21652.00 frames. ], tot_loss[loss=0.2084, simple_loss=0.2841, pruned_loss=0.0663, over 4282658.79 frames. ], batch size: 298, lr: 2.69e-03, grad_scale: 32.0 2023-06-27 18:49:35,426 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.391e+02 8.685e+02 1.394e+03 1.830e+03 3.525e+03, threshold=2.788e+03, percent-clipped=36.0 2023-06-27 18:50:04,641 INFO [train.py:996] (2/4) Epoch 11, batch 7250, loss[loss=0.188, simple_loss=0.2484, pruned_loss=0.06382, over 21423.00 frames. ], tot_loss[loss=0.2058, simple_loss=0.2812, pruned_loss=0.06523, over 4283818.50 frames. ], batch size: 195, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:50:18,768 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1873176.0, ans=0.1 2023-06-27 18:50:48,562 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1873296.0, ans=0.125 2023-06-27 18:50:53,263 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1873296.0, ans=0.125 2023-06-27 18:51:05,115 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1873356.0, ans=0.0 2023-06-27 18:51:11,717 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 18:51:47,392 INFO [train.py:996] (2/4) Epoch 11, batch 7300, loss[loss=0.1981, simple_loss=0.2659, pruned_loss=0.06509, over 21829.00 frames. ], tot_loss[loss=0.2019, simple_loss=0.2756, pruned_loss=0.06413, over 4285552.44 frames. ], batch size: 107, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:52:08,282 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1873536.0, ans=0.0 2023-06-27 18:52:15,178 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1873536.0, ans=0.07 2023-06-27 18:52:17,600 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.90 vs. limit=15.0 2023-06-27 18:52:33,326 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1873596.0, ans=0.125 2023-06-27 18:53:00,323 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.280e+02 7.298e+02 1.227e+03 1.780e+03 3.301e+03, threshold=2.454e+03, percent-clipped=5.0 2023-06-27 18:53:30,266 INFO [train.py:996] (2/4) Epoch 11, batch 7350, loss[loss=0.2658, simple_loss=0.3414, pruned_loss=0.09509, over 21745.00 frames. ], tot_loss[loss=0.2012, simple_loss=0.2739, pruned_loss=0.0643, over 4274793.38 frames. ], batch size: 124, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:53:37,697 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1873776.0, ans=0.125 2023-06-27 18:53:39,370 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1873776.0, ans=0.125 2023-06-27 18:53:47,697 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1873836.0, ans=0.1 2023-06-27 18:53:52,791 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1873836.0, ans=0.95 2023-06-27 18:54:10,138 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=1873896.0, ans=15.0 2023-06-27 18:54:27,927 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1873896.0, ans=0.125 2023-06-27 18:55:13,751 INFO [train.py:996] (2/4) Epoch 11, batch 7400, loss[loss=0.1953, simple_loss=0.2906, pruned_loss=0.04994, over 21725.00 frames. ], tot_loss[loss=0.2049, simple_loss=0.2784, pruned_loss=0.06576, over 4274783.78 frames. ], batch size: 332, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:55:29,637 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1874136.0, ans=0.125 2023-06-27 18:55:47,309 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1874136.0, ans=0.125 2023-06-27 18:56:22,655 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1874256.0, ans=0.2 2023-06-27 18:56:28,745 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1874256.0, ans=0.125 2023-06-27 18:56:31,615 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.427e+02 7.089e+02 1.051e+03 1.718e+03 3.603e+03, threshold=2.102e+03, percent-clipped=3.0 2023-06-27 18:56:54,273 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1874316.0, ans=0.1 2023-06-27 18:56:57,308 INFO [train.py:996] (2/4) Epoch 11, batch 7450, loss[loss=0.1911, simple_loss=0.2628, pruned_loss=0.05972, over 21570.00 frames. ], tot_loss[loss=0.2031, simple_loss=0.2767, pruned_loss=0.06475, over 4273135.83 frames. ], batch size: 247, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:57:04,810 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1874376.0, ans=0.0 2023-06-27 18:57:14,801 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1874436.0, ans=0.07 2023-06-27 18:57:27,961 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1874436.0, ans=0.0 2023-06-27 18:57:40,012 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.29 vs. limit=12.0 2023-06-27 18:58:06,854 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1874556.0, ans=0.125 2023-06-27 18:58:41,428 INFO [train.py:996] (2/4) Epoch 11, batch 7500, loss[loss=0.1865, simple_loss=0.2421, pruned_loss=0.06538, over 20870.00 frames. ], tot_loss[loss=0.2071, simple_loss=0.2812, pruned_loss=0.06651, over 4273866.31 frames. ], batch size: 608, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 18:58:55,528 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1874676.0, ans=0.95 2023-06-27 18:58:57,878 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.52 vs. limit=15.0 2023-06-27 18:59:15,346 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1874736.0, ans=0.2 2023-06-27 18:59:18,538 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1874736.0, ans=0.125 2023-06-27 18:59:25,508 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1874796.0, ans=0.0 2023-06-27 19:00:03,649 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1874856.0, ans=0.0 2023-06-27 19:00:04,712 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.389e+02 7.977e+02 1.325e+03 1.991e+03 3.400e+03, threshold=2.650e+03, percent-clipped=21.0 2023-06-27 19:00:24,559 INFO [train.py:996] (2/4) Epoch 11, batch 7550, loss[loss=0.1854, simple_loss=0.287, pruned_loss=0.04196, over 21690.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.2896, pruned_loss=0.0663, over 4274758.69 frames. ], batch size: 298, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:00:25,661 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.51 vs. limit=15.0 2023-06-27 19:00:30,012 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1874976.0, ans=0.2 2023-06-27 19:00:30,070 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1874976.0, ans=0.125 2023-06-27 19:01:08,019 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1875096.0, ans=0.2 2023-06-27 19:01:12,581 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1875096.0, ans=0.1 2023-06-27 19:01:48,515 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1875216.0, ans=0.125 2023-06-27 19:01:54,930 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1875216.0, ans=0.1 2023-06-27 19:01:55,547 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.23 vs. limit=15.0 2023-06-27 19:02:05,494 INFO [train.py:996] (2/4) Epoch 11, batch 7600, loss[loss=0.2208, simple_loss=0.3005, pruned_loss=0.07061, over 21758.00 frames. ], tot_loss[loss=0.2088, simple_loss=0.2883, pruned_loss=0.0646, over 4278253.44 frames. ], batch size: 112, lr: 2.68e-03, grad_scale: 32.0 2023-06-27 19:02:17,906 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1875276.0, ans=0.04949747468305833 2023-06-27 19:02:32,354 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1875336.0, ans=0.1 2023-06-27 19:02:43,607 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1875336.0, ans=0.0 2023-06-27 19:02:45,322 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1875336.0, ans=0.125 2023-06-27 19:03:28,849 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.792e+02 7.250e+02 9.858e+02 1.337e+03 3.374e+03, threshold=1.972e+03, percent-clipped=5.0 2023-06-27 19:03:47,207 INFO [train.py:996] (2/4) Epoch 11, batch 7650, loss[loss=0.2068, simple_loss=0.2768, pruned_loss=0.06841, over 21952.00 frames. ], tot_loss[loss=0.2106, simple_loss=0.2881, pruned_loss=0.06654, over 4279542.10 frames. ], batch size: 316, lr: 2.68e-03, grad_scale: 8.0 2023-06-27 19:03:58,005 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1875576.0, ans=0.125 2023-06-27 19:04:13,385 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.32 vs. limit=15.0 2023-06-27 19:04:46,468 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1875696.0, ans=0.2 2023-06-27 19:05:13,794 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.75 vs. limit=15.0 2023-06-27 19:05:16,437 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1875816.0, ans=0.125 2023-06-27 19:05:30,771 INFO [train.py:996] (2/4) Epoch 11, batch 7700, loss[loss=0.1829, simple_loss=0.2383, pruned_loss=0.06373, over 20819.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.291, pruned_loss=0.06907, over 4281233.37 frames. ], batch size: 609, lr: 2.68e-03, grad_scale: 8.0 2023-06-27 19:05:33,873 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.42 vs. limit=15.0 2023-06-27 19:06:09,343 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1875936.0, ans=0.025 2023-06-27 19:06:41,547 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1876056.0, ans=0.125 2023-06-27 19:06:51,928 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1876056.0, ans=0.0 2023-06-27 19:06:53,479 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1876056.0, ans=0.125 2023-06-27 19:06:59,793 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.712e+02 8.160e+02 1.175e+03 1.754e+03 4.757e+03, threshold=2.350e+03, percent-clipped=23.0 2023-06-27 19:07:16,831 INFO [train.py:996] (2/4) Epoch 11, batch 7750, loss[loss=0.1977, simple_loss=0.288, pruned_loss=0.05369, over 20705.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.2966, pruned_loss=0.06943, over 4279498.84 frames. ], batch size: 607, lr: 2.68e-03, grad_scale: 8.0 2023-06-27 19:08:13,298 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=1876296.0, ans=15.0 2023-06-27 19:08:40,064 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1876356.0, ans=0.2 2023-06-27 19:09:09,424 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1876476.0, ans=0.0 2023-06-27 19:09:10,458 INFO [train.py:996] (2/4) Epoch 11, batch 7800, loss[loss=0.1837, simple_loss=0.2161, pruned_loss=0.07567, over 16599.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.2983, pruned_loss=0.06982, over 4272630.27 frames. ], batch size: 60, lr: 2.68e-03, grad_scale: 8.0 2023-06-27 19:09:14,896 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1876476.0, ans=0.125 2023-06-27 19:10:24,636 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.82 vs. limit=6.0 2023-06-27 19:10:26,680 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.510e+02 6.767e+02 1.181e+03 1.586e+03 4.451e+03, threshold=2.363e+03, percent-clipped=7.0 2023-06-27 19:10:44,184 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1876716.0, ans=0.125 2023-06-27 19:10:53,763 INFO [train.py:996] (2/4) Epoch 11, batch 7850, loss[loss=0.1883, simple_loss=0.2472, pruned_loss=0.06472, over 21150.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2915, pruned_loss=0.06931, over 4269675.29 frames. ], batch size: 176, lr: 2.68e-03, grad_scale: 8.0 2023-06-27 19:11:56,099 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 19:12:40,432 INFO [train.py:996] (2/4) Epoch 11, batch 7900, loss[loss=0.1742, simple_loss=0.2397, pruned_loss=0.05433, over 21422.00 frames. ], tot_loss[loss=0.2122, simple_loss=0.287, pruned_loss=0.06867, over 4264332.31 frames. ], batch size: 131, lr: 2.68e-03, grad_scale: 8.0 2023-06-27 19:14:08,126 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.410e+02 7.562e+02 1.142e+03 1.795e+03 4.843e+03, threshold=2.283e+03, percent-clipped=15.0 2023-06-27 19:14:08,734 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1877316.0, ans=0.125 2023-06-27 19:14:29,968 INFO [train.py:996] (2/4) Epoch 11, batch 7950, loss[loss=0.2141, simple_loss=0.3022, pruned_loss=0.06295, over 21784.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.2906, pruned_loss=0.06706, over 4258552.41 frames. ], batch size: 332, lr: 2.68e-03, grad_scale: 8.0 2023-06-27 19:14:59,761 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1877436.0, ans=0.125 2023-06-27 19:15:01,577 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1877436.0, ans=0.125 2023-06-27 19:15:01,650 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1877436.0, ans=10.0 2023-06-27 19:15:17,082 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1877496.0, ans=0.125 2023-06-27 19:16:05,464 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1877616.0, ans=0.04949747468305833 2023-06-27 19:16:22,063 INFO [train.py:996] (2/4) Epoch 11, batch 8000, loss[loss=0.3023, simple_loss=0.3772, pruned_loss=0.1138, over 21371.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.2945, pruned_loss=0.06865, over 4261055.85 frames. ], batch size: 507, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:16:49,776 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.31 vs. limit=15.0 2023-06-27 19:17:04,325 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1877736.0, ans=0.125 2023-06-27 19:17:07,648 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1877796.0, ans=0.125 2023-06-27 19:17:35,411 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=1877856.0, ans=10.0 2023-06-27 19:17:48,964 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1877856.0, ans=0.0 2023-06-27 19:17:51,524 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.031e+02 6.364e+02 9.395e+02 1.417e+03 3.378e+03, threshold=1.879e+03, percent-clipped=5.0 2023-06-27 19:17:52,922 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.99 vs. limit=12.0 2023-06-27 19:18:08,686 INFO [train.py:996] (2/4) Epoch 11, batch 8050, loss[loss=0.3296, simple_loss=0.4055, pruned_loss=0.1268, over 21445.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.3011, pruned_loss=0.07053, over 4265277.49 frames. ], batch size: 507, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:18:11,566 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.15 vs. limit=12.0 2023-06-27 19:18:50,806 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=1878096.0, ans=0.95 2023-06-27 19:19:18,224 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.94 vs. limit=22.5 2023-06-27 19:19:53,006 INFO [train.py:996] (2/4) Epoch 11, batch 8100, loss[loss=0.2158, simple_loss=0.2788, pruned_loss=0.07637, over 21361.00 frames. ], tot_loss[loss=0.2215, simple_loss=0.3012, pruned_loss=0.07093, over 4275942.99 frames. ], batch size: 159, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:20:29,624 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.83 vs. limit=15.0 2023-06-27 19:21:17,724 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1878456.0, ans=0.0 2023-06-27 19:21:22,434 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.025e+02 8.290e+02 1.329e+03 2.139e+03 5.514e+03, threshold=2.658e+03, percent-clipped=35.0 2023-06-27 19:21:26,480 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1878516.0, ans=0.2 2023-06-27 19:21:45,162 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.12 vs. limit=15.0 2023-06-27 19:21:47,880 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1878576.0, ans=0.0 2023-06-27 19:21:48,885 INFO [train.py:996] (2/4) Epoch 11, batch 8150, loss[loss=0.1757, simple_loss=0.2458, pruned_loss=0.05278, over 21276.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.3057, pruned_loss=0.07138, over 4275911.44 frames. ], batch size: 159, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:21:52,878 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1878576.0, ans=0.2 2023-06-27 19:22:50,148 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1878756.0, ans=0.0 2023-06-27 19:23:31,192 INFO [train.py:996] (2/4) Epoch 11, batch 8200, loss[loss=0.2059, simple_loss=0.2692, pruned_loss=0.07131, over 21583.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.2984, pruned_loss=0.06919, over 4273176.11 frames. ], batch size: 415, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:24:15,328 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1878996.0, ans=0.125 2023-06-27 19:24:27,404 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 19:24:53,468 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.441e+02 7.151e+02 1.119e+03 1.525e+03 4.860e+03, threshold=2.239e+03, percent-clipped=3.0 2023-06-27 19:25:09,479 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.91 vs. limit=12.0 2023-06-27 19:25:15,174 INFO [train.py:996] (2/4) Epoch 11, batch 8250, loss[loss=0.3071, simple_loss=0.3707, pruned_loss=0.1217, over 21484.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.2974, pruned_loss=0.06953, over 4271806.36 frames. ], batch size: 508, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:25:53,051 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1879236.0, ans=0.125 2023-06-27 19:26:09,533 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1879296.0, ans=0.125 2023-06-27 19:26:21,983 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.62 vs. limit=15.0 2023-06-27 19:26:44,734 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1879416.0, ans=0.125 2023-06-27 19:26:50,338 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.95 vs. limit=12.0 2023-06-27 19:26:59,255 INFO [train.py:996] (2/4) Epoch 11, batch 8300, loss[loss=0.2347, simple_loss=0.3225, pruned_loss=0.07343, over 21617.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2963, pruned_loss=0.0669, over 4260550.49 frames. ], batch size: 389, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:27:19,808 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1879536.0, ans=0.0 2023-06-27 19:27:40,998 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1879596.0, ans=0.125 2023-06-27 19:27:58,594 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.89 vs. limit=10.0 2023-06-27 19:28:25,641 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.537e+02 6.833e+02 1.058e+03 1.562e+03 3.226e+03, threshold=2.116e+03, percent-clipped=10.0 2023-06-27 19:28:26,831 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.73 vs. limit=15.0 2023-06-27 19:28:41,969 INFO [train.py:996] (2/4) Epoch 11, batch 8350, loss[loss=0.2034, simple_loss=0.2873, pruned_loss=0.05977, over 19896.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.296, pruned_loss=0.06545, over 4266097.54 frames. ], batch size: 703, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:29:00,456 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1879776.0, ans=0.1 2023-06-27 19:29:12,030 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1879836.0, ans=0.125 2023-06-27 19:29:15,289 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1879836.0, ans=0.125 2023-06-27 19:29:32,659 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1879896.0, ans=0.0 2023-06-27 19:30:04,276 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1879956.0, ans=0.125 2023-06-27 19:30:10,845 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1880016.0, ans=0.1 2023-06-27 19:30:14,329 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1880016.0, ans=0.0 2023-06-27 19:30:29,634 INFO [train.py:996] (2/4) Epoch 11, batch 8400, loss[loss=0.2178, simple_loss=0.3135, pruned_loss=0.0611, over 21615.00 frames. ], tot_loss[loss=0.2099, simple_loss=0.2933, pruned_loss=0.06327, over 4260913.95 frames. ], batch size: 441, lr: 2.68e-03, grad_scale: 32.0 2023-06-27 19:30:38,879 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=1880076.0, ans=10.0 2023-06-27 19:31:21,420 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1880196.0, ans=0.125 2023-06-27 19:31:33,146 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.44 vs. limit=22.5 2023-06-27 19:31:51,129 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.798e+02 6.790e+02 1.029e+03 1.707e+03 4.211e+03, threshold=2.059e+03, percent-clipped=16.0 2023-06-27 19:31:52,423 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.74 vs. limit=12.0 2023-06-27 19:32:11,261 INFO [train.py:996] (2/4) Epoch 11, batch 8450, loss[loss=0.177, simple_loss=0.2624, pruned_loss=0.04578, over 21510.00 frames. ], tot_loss[loss=0.2081, simple_loss=0.2909, pruned_loss=0.06271, over 4265988.53 frames. ], batch size: 212, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:32:29,861 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1880376.0, ans=0.125 2023-06-27 19:32:41,115 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1880436.0, ans=0.07 2023-06-27 19:32:54,202 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1880496.0, ans=0.1 2023-06-27 19:33:48,577 INFO [train.py:996] (2/4) Epoch 11, batch 8500, loss[loss=0.2113, simple_loss=0.2791, pruned_loss=0.07179, over 21686.00 frames. ], tot_loss[loss=0.2064, simple_loss=0.2858, pruned_loss=0.0635, over 4261937.26 frames. ], batch size: 247, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:34:40,039 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1880796.0, ans=0.1 2023-06-27 19:34:45,235 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1880796.0, ans=0.2 2023-06-27 19:35:06,720 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1880856.0, ans=0.2 2023-06-27 19:35:17,400 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.145e+02 8.155e+02 1.098e+03 1.780e+03 3.950e+03, threshold=2.195e+03, percent-clipped=18.0 2023-06-27 19:35:37,556 INFO [train.py:996] (2/4) Epoch 11, batch 8550, loss[loss=0.2064, simple_loss=0.2972, pruned_loss=0.05785, over 21748.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.2915, pruned_loss=0.06699, over 4265536.35 frames. ], batch size: 247, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:36:00,460 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1881036.0, ans=0.125 2023-06-27 19:36:04,693 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.08 vs. limit=12.0 2023-06-27 19:36:39,387 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1881156.0, ans=0.0 2023-06-27 19:37:07,528 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1881216.0, ans=0.125 2023-06-27 19:37:27,693 INFO [train.py:996] (2/4) Epoch 11, batch 8600, loss[loss=0.2685, simple_loss=0.3458, pruned_loss=0.09562, over 21361.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.2997, pruned_loss=0.06923, over 4268382.04 frames. ], batch size: 176, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:38:20,087 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1881396.0, ans=0.0 2023-06-27 19:38:50,814 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.911e+02 7.000e+02 1.009e+03 1.607e+03 3.888e+03, threshold=2.018e+03, percent-clipped=13.0 2023-06-27 19:39:11,186 INFO [train.py:996] (2/4) Epoch 11, batch 8650, loss[loss=0.2297, simple_loss=0.335, pruned_loss=0.06219, over 21768.00 frames. ], tot_loss[loss=0.2215, simple_loss=0.3045, pruned_loss=0.06929, over 4264177.65 frames. ], batch size: 282, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:39:41,627 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1881636.0, ans=0.125 2023-06-27 19:40:30,581 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.52 vs. limit=10.0 2023-06-27 19:40:33,208 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1881816.0, ans=0.125 2023-06-27 19:40:52,493 INFO [train.py:996] (2/4) Epoch 11, batch 8700, loss[loss=0.1831, simple_loss=0.2567, pruned_loss=0.05474, over 21371.00 frames. ], tot_loss[loss=0.2137, simple_loss=0.296, pruned_loss=0.06564, over 4263706.36 frames. ], batch size: 131, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:41:03,078 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1881876.0, ans=0.125 2023-06-27 19:41:03,180 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1881876.0, ans=0.1 2023-06-27 19:42:15,392 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.345e+02 6.737e+02 1.063e+03 1.710e+03 3.619e+03, threshold=2.126e+03, percent-clipped=15.0 2023-06-27 19:42:35,711 INFO [train.py:996] (2/4) Epoch 11, batch 8750, loss[loss=0.2226, simple_loss=0.2873, pruned_loss=0.07895, over 21467.00 frames. ], tot_loss[loss=0.2129, simple_loss=0.2936, pruned_loss=0.06608, over 4261702.75 frames. ], batch size: 144, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:42:48,966 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.05 vs. limit=6.0 2023-06-27 19:42:52,236 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=1882236.0, ans=15.0 2023-06-27 19:44:18,380 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1882476.0, ans=0.125 2023-06-27 19:44:19,329 INFO [train.py:996] (2/4) Epoch 11, batch 8800, loss[loss=0.1973, simple_loss=0.2736, pruned_loss=0.06048, over 20799.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.3027, pruned_loss=0.06774, over 4257584.33 frames. ], batch size: 608, lr: 2.68e-03, grad_scale: 32.0 2023-06-27 19:44:21,837 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1882476.0, ans=0.0 2023-06-27 19:45:06,705 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1882596.0, ans=0.07 2023-06-27 19:45:32,076 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.11 vs. limit=15.0 2023-06-27 19:45:45,431 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.70 vs. limit=15.0 2023-06-27 19:45:49,238 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.680e+02 9.134e+02 1.413e+03 2.470e+03 4.738e+03, threshold=2.826e+03, percent-clipped=30.0 2023-06-27 19:45:51,577 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1882716.0, ans=0.05 2023-06-27 19:46:02,354 INFO [train.py:996] (2/4) Epoch 11, batch 8850, loss[loss=0.2267, simple_loss=0.3007, pruned_loss=0.07635, over 21557.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.3082, pruned_loss=0.07045, over 4253611.02 frames. ], batch size: 441, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:46:50,600 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1882896.0, ans=0.125 2023-06-27 19:47:13,837 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1882956.0, ans=0.0 2023-06-27 19:47:34,442 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1883016.0, ans=0.95 2023-06-27 19:47:50,834 INFO [train.py:996] (2/4) Epoch 11, batch 8900, loss[loss=0.1781, simple_loss=0.2517, pruned_loss=0.05223, over 21432.00 frames. ], tot_loss[loss=0.2192, simple_loss=0.3007, pruned_loss=0.06888, over 4259035.50 frames. ], batch size: 194, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:47:58,152 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1883076.0, ans=0.125 2023-06-27 19:48:27,631 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1883136.0, ans=0.125 2023-06-27 19:48:50,057 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.02 vs. limit=22.5 2023-06-27 19:49:18,772 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1883316.0, ans=0.1 2023-06-27 19:49:23,140 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.611e+02 6.392e+02 1.039e+03 1.753e+03 5.076e+03, threshold=2.078e+03, percent-clipped=8.0 2023-06-27 19:49:36,286 INFO [train.py:996] (2/4) Epoch 11, batch 8950, loss[loss=0.248, simple_loss=0.3362, pruned_loss=0.07993, over 21614.00 frames. ], tot_loss[loss=0.2192, simple_loss=0.3014, pruned_loss=0.06855, over 4261987.05 frames. ], batch size: 414, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:49:52,332 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.73 vs. limit=22.5 2023-06-27 19:50:28,045 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1883496.0, ans=0.125 2023-06-27 19:51:05,511 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1883616.0, ans=0.04949747468305833 2023-06-27 19:51:09,070 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1883616.0, ans=0.125 2023-06-27 19:51:09,727 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.29 vs. limit=15.0 2023-06-27 19:51:18,655 INFO [train.py:996] (2/4) Epoch 11, batch 9000, loss[loss=0.1954, simple_loss=0.2708, pruned_loss=0.05996, over 21535.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2941, pruned_loss=0.06823, over 4265240.91 frames. ], batch size: 195, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:51:18,656 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-27 19:51:37,904 INFO [train.py:1028] (2/4) Epoch 11, validation: loss=0.2621, simple_loss=0.3543, pruned_loss=0.08494, over 1796401.00 frames. 2023-06-27 19:51:37,905 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 23793MB 2023-06-27 19:51:38,955 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1883676.0, ans=0.125 2023-06-27 19:52:46,056 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1883856.0, ans=0.0 2023-06-27 19:52:53,231 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1883856.0, ans=0.05 2023-06-27 19:52:56,756 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1883856.0, ans=0.1 2023-06-27 19:52:58,294 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1883856.0, ans=0.125 2023-06-27 19:53:04,550 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.323e+02 6.298e+02 8.263e+02 1.367e+03 3.761e+03, threshold=1.653e+03, percent-clipped=12.0 2023-06-27 19:53:16,682 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1883916.0, ans=0.125 2023-06-27 19:53:28,400 INFO [train.py:996] (2/4) Epoch 11, batch 9050, loss[loss=0.1893, simple_loss=0.2738, pruned_loss=0.05235, over 21536.00 frames. ], tot_loss[loss=0.2093, simple_loss=0.2886, pruned_loss=0.06497, over 4262564.04 frames. ], batch size: 230, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:54:04,958 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1884036.0, ans=0.0 2023-06-27 19:54:09,914 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1884036.0, ans=0.0 2023-06-27 19:54:20,203 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_ff2.min_abs, batch_count=1884096.0, ans=0.1 2023-06-27 19:54:26,726 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1884096.0, ans=0.125 2023-06-27 19:54:35,161 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1884156.0, ans=0.1 2023-06-27 19:55:13,463 INFO [train.py:996] (2/4) Epoch 11, batch 9100, loss[loss=0.234, simple_loss=0.3141, pruned_loss=0.07699, over 21710.00 frames. ], tot_loss[loss=0.2158, simple_loss=0.2957, pruned_loss=0.06801, over 4255315.72 frames. ], batch size: 332, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:56:18,596 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1884456.0, ans=0.125 2023-06-27 19:56:44,831 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.255e+02 7.145e+02 1.042e+03 1.570e+03 3.461e+03, threshold=2.085e+03, percent-clipped=19.0 2023-06-27 19:56:48,864 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1884516.0, ans=0.125 2023-06-27 19:56:58,821 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1884516.0, ans=0.125 2023-06-27 19:57:03,245 INFO [train.py:996] (2/4) Epoch 11, batch 9150, loss[loss=0.2536, simple_loss=0.3713, pruned_loss=0.06791, over 19699.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2991, pruned_loss=0.06619, over 4261209.43 frames. ], batch size: 702, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:57:55,805 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1884696.0, ans=0.125 2023-06-27 19:58:45,963 INFO [train.py:996] (2/4) Epoch 11, batch 9200, loss[loss=0.2194, simple_loss=0.3185, pruned_loss=0.06016, over 21048.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.3009, pruned_loss=0.0649, over 4265234.53 frames. ], batch size: 608, lr: 2.68e-03, grad_scale: 32.0 2023-06-27 19:59:08,001 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1884936.0, ans=0.2 2023-06-27 19:59:32,620 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1884996.0, ans=0.125 2023-06-27 19:59:37,680 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1884996.0, ans=0.2 2023-06-27 19:59:37,726 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1884996.0, ans=0.1 2023-06-27 20:00:09,236 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.39 vs. limit=15.0 2023-06-27 20:00:16,345 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.692e+02 7.193e+02 1.189e+03 2.039e+03 4.796e+03, threshold=2.378e+03, percent-clipped=22.0 2023-06-27 20:00:28,230 INFO [train.py:996] (2/4) Epoch 11, batch 9250, loss[loss=0.2032, simple_loss=0.2655, pruned_loss=0.07048, over 21218.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.304, pruned_loss=0.06729, over 4266022.77 frames. ], batch size: 176, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 20:01:05,920 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.50 vs. limit=10.0 2023-06-27 20:01:12,502 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1885296.0, ans=0.125 2023-06-27 20:01:17,435 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1885296.0, ans=0.1 2023-06-27 20:01:51,175 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1885356.0, ans=0.0 2023-06-27 20:02:17,643 INFO [train.py:996] (2/4) Epoch 11, batch 9300, loss[loss=0.2083, simple_loss=0.2845, pruned_loss=0.06612, over 21326.00 frames. ], tot_loss[loss=0.2164, simple_loss=0.2984, pruned_loss=0.0672, over 4265788.81 frames. ], batch size: 131, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 20:02:35,682 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=10.08 vs. limit=10.0 2023-06-27 20:03:30,725 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1885656.0, ans=0.125 2023-06-27 20:03:33,052 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.43 vs. limit=15.0 2023-06-27 20:03:50,077 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.175e+02 5.672e+02 8.335e+02 1.329e+03 3.533e+03, threshold=1.667e+03, percent-clipped=8.0 2023-06-27 20:03:51,243 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.12 vs. limit=22.5 2023-06-27 20:04:01,344 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1885776.0, ans=0.1 2023-06-27 20:04:02,355 INFO [train.py:996] (2/4) Epoch 11, batch 9350, loss[loss=0.278, simple_loss=0.3518, pruned_loss=0.1021, over 21434.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.3021, pruned_loss=0.06807, over 4262932.18 frames. ], batch size: 471, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 20:04:18,372 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1885836.0, ans=0.125 2023-06-27 20:04:46,680 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1885896.0, ans=0.0 2023-06-27 20:05:34,940 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1886016.0, ans=0.1 2023-06-27 20:05:45,799 INFO [train.py:996] (2/4) Epoch 11, batch 9400, loss[loss=0.1976, simple_loss=0.2658, pruned_loss=0.06468, over 21533.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.3041, pruned_loss=0.06882, over 4268011.70 frames. ], batch size: 230, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 20:06:42,865 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=6.63 vs. limit=22.5 2023-06-27 20:07:03,881 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1886256.0, ans=0.0 2023-06-27 20:07:13,527 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1886316.0, ans=0.0 2023-06-27 20:07:16,433 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.854e+02 7.483e+02 1.060e+03 1.789e+03 3.889e+03, threshold=2.119e+03, percent-clipped=27.0 2023-06-27 20:07:27,689 INFO [train.py:996] (2/4) Epoch 11, batch 9450, loss[loss=0.1937, simple_loss=0.2689, pruned_loss=0.05918, over 21342.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2965, pruned_loss=0.06741, over 4270589.50 frames. ], batch size: 131, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 20:07:35,817 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.13 vs. limit=12.0 2023-06-27 20:07:50,642 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.55 vs. limit=10.0 2023-06-27 20:08:06,323 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=11.56 vs. limit=15.0 2023-06-27 20:08:15,546 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1886496.0, ans=0.125 2023-06-27 20:08:23,907 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1886496.0, ans=0.015 2023-06-27 20:09:11,693 INFO [train.py:996] (2/4) Epoch 11, batch 9500, loss[loss=0.2141, simple_loss=0.292, pruned_loss=0.06815, over 21841.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.2891, pruned_loss=0.06563, over 4268441.97 frames. ], batch size: 124, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 20:09:24,433 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1886676.0, ans=0.125 2023-06-27 20:09:32,646 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1886736.0, ans=0.125 2023-06-27 20:09:47,863 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1886736.0, ans=0.125 2023-06-27 20:09:59,933 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1886796.0, ans=0.0 2023-06-27 20:10:12,911 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1886796.0, ans=0.125 2023-06-27 20:10:16,293 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 20:10:35,993 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1886916.0, ans=0.1 2023-06-27 20:10:36,036 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1886916.0, ans=0.1 2023-06-27 20:10:38,453 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.105e+02 7.389e+02 1.115e+03 1.559e+03 4.093e+03, threshold=2.229e+03, percent-clipped=13.0 2023-06-27 20:10:49,770 INFO [train.py:996] (2/4) Epoch 11, batch 9550, loss[loss=0.2417, simple_loss=0.3165, pruned_loss=0.08342, over 21381.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2932, pruned_loss=0.06695, over 4256878.55 frames. ], batch size: 131, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 20:10:52,712 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.73 vs. limit=10.0 2023-06-27 20:11:09,028 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.66 vs. limit=15.0 2023-06-27 20:11:26,911 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.80 vs. limit=15.0 2023-06-27 20:11:28,522 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.62 vs. limit=15.0 2023-06-27 20:11:34,732 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1887096.0, ans=0.125 2023-06-27 20:11:36,145 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1887096.0, ans=0.0 2023-06-27 20:12:15,748 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1887216.0, ans=0.1 2023-06-27 20:12:26,517 INFO [train.py:996] (2/4) Epoch 11, batch 9600, loss[loss=0.2102, simple_loss=0.2922, pruned_loss=0.06411, over 21422.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2977, pruned_loss=0.06858, over 4263994.16 frames. ], batch size: 131, lr: 2.68e-03, grad_scale: 32.0 2023-06-27 20:12:30,792 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1887276.0, ans=0.0 2023-06-27 20:13:05,403 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff2.min_abs, batch_count=1887336.0, ans=0.1 2023-06-27 20:13:54,721 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.225e+02 7.650e+02 1.090e+03 1.713e+03 4.107e+03, threshold=2.181e+03, percent-clipped=11.0 2023-06-27 20:14:00,643 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1887516.0, ans=0.0 2023-06-27 20:14:05,165 INFO [train.py:996] (2/4) Epoch 11, batch 9650, loss[loss=0.2265, simple_loss=0.3116, pruned_loss=0.07069, over 21583.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.2977, pruned_loss=0.06882, over 4272326.82 frames. ], batch size: 389, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 20:14:35,848 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.97 vs. limit=15.0 2023-06-27 20:15:53,495 INFO [train.py:996] (2/4) Epoch 11, batch 9700, loss[loss=0.2488, simple_loss=0.3167, pruned_loss=0.09044, over 21637.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.3001, pruned_loss=0.06931, over 4279880.51 frames. ], batch size: 508, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 20:16:08,704 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1887876.0, ans=0.0 2023-06-27 20:16:35,793 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1887996.0, ans=0.1 2023-06-27 20:17:09,644 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1888056.0, ans=0.2 2023-06-27 20:17:20,718 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.740e+02 5.952e+02 8.386e+02 1.192e+03 2.882e+03, threshold=1.677e+03, percent-clipped=3.0 2023-06-27 20:17:35,537 INFO [train.py:996] (2/4) Epoch 11, batch 9750, loss[loss=0.2053, simple_loss=0.2738, pruned_loss=0.06839, over 21513.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.2933, pruned_loss=0.06798, over 4271963.36 frames. ], batch size: 391, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 20:17:42,192 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1888176.0, ans=0.2 2023-06-27 20:18:03,152 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 20:18:37,219 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1888356.0, ans=0.125 2023-06-27 20:18:46,864 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1888356.0, ans=0.125 2023-06-27 20:18:50,058 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1888356.0, ans=0.0 2023-06-27 20:19:10,810 INFO [train.py:996] (2/4) Epoch 11, batch 9800, loss[loss=0.1983, simple_loss=0.27, pruned_loss=0.06333, over 21573.00 frames. ], tot_loss[loss=0.2158, simple_loss=0.295, pruned_loss=0.06835, over 4263051.42 frames. ], batch size: 263, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 20:19:48,501 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1888536.0, ans=0.0 2023-06-27 20:20:32,927 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1888656.0, ans=0.1 2023-06-27 20:20:42,519 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.393e+02 6.342e+02 8.538e+02 1.222e+03 6.218e+03, threshold=1.708e+03, percent-clipped=13.0 2023-06-27 20:20:47,843 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1888716.0, ans=0.0 2023-06-27 20:20:52,442 INFO [train.py:996] (2/4) Epoch 11, batch 9850, loss[loss=0.1788, simple_loss=0.2411, pruned_loss=0.05825, over 21589.00 frames. ], tot_loss[loss=0.2142, simple_loss=0.2916, pruned_loss=0.06837, over 4263882.73 frames. ], batch size: 195, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 20:21:37,608 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.13 vs. limit=22.5 2023-06-27 20:21:49,571 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.99 vs. limit=6.0 2023-06-27 20:22:01,868 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1888956.0, ans=0.07 2023-06-27 20:22:11,951 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1888956.0, ans=0.0 2023-06-27 20:22:17,095 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1889016.0, ans=0.0 2023-06-27 20:22:22,528 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1889016.0, ans=0.125 2023-06-27 20:22:35,620 INFO [train.py:996] (2/4) Epoch 11, batch 9900, loss[loss=0.1934, simple_loss=0.2506, pruned_loss=0.06809, over 20638.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.2869, pruned_loss=0.06774, over 4258308.20 frames. ], batch size: 607, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 20:22:41,119 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1889076.0, ans=10.0 2023-06-27 20:22:55,471 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1889076.0, ans=0.0 2023-06-27 20:23:52,739 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.33 vs. limit=15.0 2023-06-27 20:24:07,975 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.793e+02 7.874e+02 1.115e+03 1.655e+03 5.340e+03, threshold=2.230e+03, percent-clipped=22.0 2023-06-27 20:24:18,310 INFO [train.py:996] (2/4) Epoch 11, batch 9950, loss[loss=0.2068, simple_loss=0.2849, pruned_loss=0.0643, over 21342.00 frames. ], tot_loss[loss=0.213, simple_loss=0.2875, pruned_loss=0.06929, over 4247976.37 frames. ], batch size: 176, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 20:24:48,374 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 20:25:03,743 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1889436.0, ans=0.125 2023-06-27 20:25:03,799 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1889436.0, ans=0.1 2023-06-27 20:25:27,619 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1889556.0, ans=0.2 2023-06-27 20:25:40,239 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=12.95 vs. limit=15.0 2023-06-27 20:25:42,726 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1889556.0, ans=0.0 2023-06-27 20:25:53,110 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1889616.0, ans=0.1 2023-06-27 20:26:16,676 INFO [train.py:996] (2/4) Epoch 11, batch 10000, loss[loss=0.1989, simple_loss=0.2694, pruned_loss=0.0642, over 21105.00 frames. ], tot_loss[loss=0.2107, simple_loss=0.2846, pruned_loss=0.06844, over 4247255.19 frames. ], batch size: 607, lr: 2.67e-03, grad_scale: 32.0 2023-06-27 20:26:59,928 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1889796.0, ans=0.2 2023-06-27 20:27:06,866 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1889796.0, ans=0.125 2023-06-27 20:27:08,416 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1889796.0, ans=0.04949747468305833 2023-06-27 20:27:11,829 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1889856.0, ans=0.125 2023-06-27 20:27:18,535 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1889856.0, ans=0.015 2023-06-27 20:27:52,691 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.633e+02 6.432e+02 1.028e+03 1.503e+03 2.874e+03, threshold=2.056e+03, percent-clipped=5.0 2023-06-27 20:27:57,229 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.90 vs. limit=6.0 2023-06-27 20:28:01,337 INFO [train.py:996] (2/4) Epoch 11, batch 10050, loss[loss=0.1873, simple_loss=0.2614, pruned_loss=0.05656, over 21606.00 frames. ], tot_loss[loss=0.2107, simple_loss=0.2855, pruned_loss=0.068, over 4258618.27 frames. ], batch size: 112, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 20:28:25,395 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1890036.0, ans=0.0 2023-06-27 20:28:43,622 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1890096.0, ans=0.1 2023-06-27 20:28:43,680 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1890096.0, ans=0.2 2023-06-27 20:29:44,417 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.12 vs. limit=15.0 2023-06-27 20:29:44,814 INFO [train.py:996] (2/4) Epoch 11, batch 10100, loss[loss=0.187, simple_loss=0.2503, pruned_loss=0.06181, over 21259.00 frames. ], tot_loss[loss=0.2075, simple_loss=0.283, pruned_loss=0.06603, over 4264979.79 frames. ], batch size: 159, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 20:29:48,561 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1890276.0, ans=0.04949747468305833 2023-06-27 20:30:35,712 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1890396.0, ans=0.125 2023-06-27 20:30:55,591 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1890456.0, ans=0.025 2023-06-27 20:31:19,823 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.253e+02 6.464e+02 9.043e+02 1.577e+03 3.572e+03, threshold=1.809e+03, percent-clipped=15.0 2023-06-27 20:31:28,360 INFO [train.py:996] (2/4) Epoch 11, batch 10150, loss[loss=0.2062, simple_loss=0.2721, pruned_loss=0.07012, over 21807.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.2874, pruned_loss=0.06757, over 4268554.89 frames. ], batch size: 98, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 20:31:52,659 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1890636.0, ans=0.125 2023-06-27 20:33:06,435 INFO [train.py:996] (2/4) Epoch 11, batch 10200, loss[loss=0.1874, simple_loss=0.2673, pruned_loss=0.05371, over 21751.00 frames. ], tot_loss[loss=0.2089, simple_loss=0.2865, pruned_loss=0.06561, over 4271325.34 frames. ], batch size: 124, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 20:33:52,003 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1890996.0, ans=0.0 2023-06-27 20:34:28,376 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1891056.0, ans=0.0 2023-06-27 20:34:33,140 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1891116.0, ans=0.125 2023-06-27 20:34:41,108 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.060e+02 5.938e+02 9.160e+02 1.393e+03 3.097e+03, threshold=1.832e+03, percent-clipped=16.0 2023-06-27 20:34:49,801 INFO [train.py:996] (2/4) Epoch 11, batch 10250, loss[loss=0.2214, simple_loss=0.3025, pruned_loss=0.07015, over 21516.00 frames. ], tot_loss[loss=0.2023, simple_loss=0.2815, pruned_loss=0.06159, over 4258480.23 frames. ], batch size: 509, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 20:35:14,384 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.98 vs. limit=22.5 2023-06-27 20:35:28,974 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1891296.0, ans=10.0 2023-06-27 20:36:10,277 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1891356.0, ans=0.04949747468305833 2023-06-27 20:36:38,667 INFO [train.py:996] (2/4) Epoch 11, batch 10300, loss[loss=0.2394, simple_loss=0.343, pruned_loss=0.06788, over 21637.00 frames. ], tot_loss[loss=0.2058, simple_loss=0.2857, pruned_loss=0.06294, over 4267541.44 frames. ], batch size: 414, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 20:36:44,185 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1891476.0, ans=0.125 2023-06-27 20:36:54,496 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1891536.0, ans=0.125 2023-06-27 20:37:09,827 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1891536.0, ans=0.125 2023-06-27 20:37:36,023 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1891596.0, ans=0.0 2023-06-27 20:37:49,687 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1891656.0, ans=0.2 2023-06-27 20:38:14,285 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.623e+02 8.106e+02 1.179e+03 1.696e+03 3.317e+03, threshold=2.359e+03, percent-clipped=22.0 2023-06-27 20:38:18,316 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1891716.0, ans=0.1 2023-06-27 20:38:22,826 INFO [train.py:996] (2/4) Epoch 11, batch 10350, loss[loss=0.2312, simple_loss=0.3129, pruned_loss=0.07477, over 21714.00 frames. ], tot_loss[loss=0.2071, simple_loss=0.2875, pruned_loss=0.06338, over 4260577.76 frames. ], batch size: 415, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 20:38:31,928 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1891776.0, ans=0.2 2023-06-27 20:38:56,101 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1891836.0, ans=0.125 2023-06-27 20:39:11,236 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1891896.0, ans=0.125 2023-06-27 20:39:41,955 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.45 vs. limit=15.0 2023-06-27 20:39:49,959 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1892016.0, ans=0.0 2023-06-27 20:40:03,129 INFO [train.py:996] (2/4) Epoch 11, batch 10400, loss[loss=0.2113, simple_loss=0.2932, pruned_loss=0.06473, over 21740.00 frames. ], tot_loss[loss=0.2042, simple_loss=0.2826, pruned_loss=0.06288, over 4257743.16 frames. ], batch size: 391, lr: 2.67e-03, grad_scale: 32.0 2023-06-27 20:40:21,418 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.23 vs. limit=6.0 2023-06-27 20:41:36,684 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.431e+02 7.002e+02 1.054e+03 1.542e+03 5.604e+03, threshold=2.109e+03, percent-clipped=11.0 2023-06-27 20:41:42,587 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1892376.0, ans=0.125 2023-06-27 20:41:43,658 INFO [train.py:996] (2/4) Epoch 11, batch 10450, loss[loss=0.2322, simple_loss=0.3141, pruned_loss=0.07514, over 21630.00 frames. ], tot_loss[loss=0.2089, simple_loss=0.2869, pruned_loss=0.06545, over 4263499.86 frames. ], batch size: 263, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 20:42:54,030 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 20:43:22,331 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.72 vs. limit=22.5 2023-06-27 20:43:35,367 INFO [train.py:996] (2/4) Epoch 11, batch 10500, loss[loss=0.1968, simple_loss=0.3001, pruned_loss=0.04677, over 20814.00 frames. ], tot_loss[loss=0.2074, simple_loss=0.2863, pruned_loss=0.0643, over 4264266.37 frames. ], batch size: 608, lr: 2.67e-03, grad_scale: 8.0 2023-06-27 20:43:42,730 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.18 vs. limit=6.0 2023-06-27 20:44:05,882 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.63 vs. limit=15.0 2023-06-27 20:44:11,397 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1892796.0, ans=0.125 2023-06-27 20:44:39,415 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1892856.0, ans=0.0 2023-06-27 20:44:40,900 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1892856.0, ans=0.125 2023-06-27 20:44:44,841 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.07 vs. limit=22.5 2023-06-27 20:44:55,778 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1892916.0, ans=0.0 2023-06-27 20:45:06,711 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.280e+02 6.335e+02 9.197e+02 1.411e+03 2.954e+03, threshold=1.839e+03, percent-clipped=7.0 2023-06-27 20:45:11,721 INFO [train.py:996] (2/4) Epoch 11, batch 10550, loss[loss=0.1877, simple_loss=0.257, pruned_loss=0.05925, over 21662.00 frames. ], tot_loss[loss=0.2046, simple_loss=0.281, pruned_loss=0.06409, over 4266442.52 frames. ], batch size: 333, lr: 2.67e-03, grad_scale: 8.0 2023-06-27 20:45:40,940 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.61 vs. limit=15.0 2023-06-27 20:45:43,571 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1893036.0, ans=0.07 2023-06-27 20:46:05,824 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.85 vs. limit=22.5 2023-06-27 20:46:46,772 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.46 vs. limit=15.0 2023-06-27 20:47:00,280 INFO [train.py:996] (2/4) Epoch 11, batch 10600, loss[loss=0.1953, simple_loss=0.2939, pruned_loss=0.04835, over 21623.00 frames. ], tot_loss[loss=0.2009, simple_loss=0.2766, pruned_loss=0.06257, over 4265558.93 frames. ], batch size: 389, lr: 2.67e-03, grad_scale: 8.0 2023-06-27 20:47:18,608 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.05 vs. limit=12.0 2023-06-27 20:47:25,290 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 20:48:01,537 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff3.min_abs, batch_count=1893456.0, ans=0.2 2023-06-27 20:48:05,112 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=1893456.0, ans=15.0 2023-06-27 20:48:21,244 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1893456.0, ans=0.125 2023-06-27 20:48:39,621 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1893516.0, ans=0.125 2023-06-27 20:48:45,441 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.137e+02 6.706e+02 1.104e+03 1.395e+03 2.716e+03, threshold=2.208e+03, percent-clipped=10.0 2023-06-27 20:48:50,912 INFO [train.py:996] (2/4) Epoch 11, batch 10650, loss[loss=0.1957, simple_loss=0.2874, pruned_loss=0.05196, over 21666.00 frames. ], tot_loss[loss=0.2014, simple_loss=0.28, pruned_loss=0.06138, over 4262307.02 frames. ], batch size: 389, lr: 2.67e-03, grad_scale: 8.0 2023-06-27 20:49:25,991 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1893636.0, ans=0.0 2023-06-27 20:49:27,555 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1893636.0, ans=0.1 2023-06-27 20:49:38,230 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1893696.0, ans=0.125 2023-06-27 20:50:28,621 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1893816.0, ans=0.0 2023-06-27 20:50:31,105 INFO [train.py:996] (2/4) Epoch 11, batch 10700, loss[loss=0.2386, simple_loss=0.3252, pruned_loss=0.07605, over 21423.00 frames. ], tot_loss[loss=0.2028, simple_loss=0.281, pruned_loss=0.06232, over 4263845.82 frames. ], batch size: 131, lr: 2.67e-03, grad_scale: 8.0 2023-06-27 20:52:10,064 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.693e+02 7.296e+02 1.063e+03 2.005e+03 4.294e+03, threshold=2.126e+03, percent-clipped=18.0 2023-06-27 20:52:14,956 INFO [train.py:996] (2/4) Epoch 11, batch 10750, loss[loss=0.22, simple_loss=0.3083, pruned_loss=0.06587, over 21808.00 frames. ], tot_loss[loss=0.212, simple_loss=0.2913, pruned_loss=0.0663, over 4267996.54 frames. ], batch size: 124, lr: 2.67e-03, grad_scale: 8.0 2023-06-27 20:52:43,148 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1894236.0, ans=0.125 2023-06-27 20:52:44,785 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1894236.0, ans=10.0 2023-06-27 20:52:49,965 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1894236.0, ans=10.0 2023-06-27 20:53:39,963 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1894356.0, ans=0.125 2023-06-27 20:53:48,970 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1894416.0, ans=0.0 2023-06-27 20:53:57,465 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1894416.0, ans=0.2 2023-06-27 20:54:00,258 INFO [train.py:996] (2/4) Epoch 11, batch 10800, loss[loss=0.2412, simple_loss=0.3171, pruned_loss=0.0826, over 20642.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.2941, pruned_loss=0.0661, over 4271576.58 frames. ], batch size: 607, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 20:54:56,033 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1894596.0, ans=0.125 2023-06-27 20:54:59,138 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1894596.0, ans=0.125 2023-06-27 20:55:02,301 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 20:55:32,408 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1894716.0, ans=0.2 2023-06-27 20:55:38,141 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.255e+02 6.646e+02 1.015e+03 1.682e+03 4.029e+03, threshold=2.031e+03, percent-clipped=15.0 2023-06-27 20:55:40,429 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1894716.0, ans=0.125 2023-06-27 20:55:43,152 INFO [train.py:996] (2/4) Epoch 11, batch 10850, loss[loss=0.2031, simple_loss=0.2875, pruned_loss=0.05929, over 21210.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2947, pruned_loss=0.06614, over 4276694.95 frames. ], batch size: 176, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 20:55:45,574 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1894776.0, ans=0.125 2023-06-27 20:56:08,842 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1894836.0, ans=0.0 2023-06-27 20:56:19,243 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.90 vs. limit=22.5 2023-06-27 20:56:37,073 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1894896.0, ans=0.2 2023-06-27 20:57:11,822 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1895016.0, ans=0.1 2023-06-27 20:57:11,909 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1895016.0, ans=0.0 2023-06-27 20:57:13,849 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.99 vs. limit=15.0 2023-06-27 20:57:25,464 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1895016.0, ans=0.1 2023-06-27 20:57:27,846 INFO [train.py:996] (2/4) Epoch 11, batch 10900, loss[loss=0.2139, simple_loss=0.3174, pruned_loss=0.05518, over 19997.00 frames. ], tot_loss[loss=0.2087, simple_loss=0.2879, pruned_loss=0.06477, over 4275755.05 frames. ], batch size: 702, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 20:57:38,170 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1895076.0, ans=0.1 2023-06-27 20:57:38,240 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1895076.0, ans=0.025 2023-06-27 20:57:41,664 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1895076.0, ans=0.125 2023-06-27 20:58:42,015 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1895256.0, ans=0.125 2023-06-27 20:58:47,030 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1895316.0, ans=0.1 2023-06-27 20:58:59,320 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.887e+02 5.631e+02 8.269e+02 1.201e+03 2.087e+03, threshold=1.654e+03, percent-clipped=2.0 2023-06-27 20:59:04,295 INFO [train.py:996] (2/4) Epoch 11, batch 10950, loss[loss=0.1829, simple_loss=0.2503, pruned_loss=0.05774, over 21825.00 frames. ], tot_loss[loss=0.2042, simple_loss=0.2836, pruned_loss=0.06243, over 4265794.14 frames. ], batch size: 98, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 21:00:04,398 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1895496.0, ans=0.1 2023-06-27 21:00:06,034 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1895496.0, ans=0.125 2023-06-27 21:00:21,511 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1895556.0, ans=0.2 2023-06-27 21:00:42,892 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1895616.0, ans=0.125 2023-06-27 21:00:50,833 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1895676.0, ans=0.1 2023-06-27 21:00:51,816 INFO [train.py:996] (2/4) Epoch 11, batch 11000, loss[loss=0.2362, simple_loss=0.2975, pruned_loss=0.08743, over 21570.00 frames. ], tot_loss[loss=0.2042, simple_loss=0.2828, pruned_loss=0.06285, over 4260972.29 frames. ], batch size: 471, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 21:01:17,594 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.82 vs. limit=15.0 2023-06-27 21:01:19,453 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.10 vs. limit=22.5 2023-06-27 21:02:14,524 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.12 vs. limit=10.0 2023-06-27 21:02:24,707 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.482e+02 6.416e+02 9.382e+02 1.390e+03 3.598e+03, threshold=1.876e+03, percent-clipped=17.0 2023-06-27 21:02:28,489 INFO [train.py:996] (2/4) Epoch 11, batch 11050, loss[loss=0.1899, simple_loss=0.2617, pruned_loss=0.05902, over 22007.00 frames. ], tot_loss[loss=0.2043, simple_loss=0.2805, pruned_loss=0.06401, over 4263520.44 frames. ], batch size: 103, lr: 2.67e-03, grad_scale: 8.0 2023-06-27 21:03:28,177 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1896096.0, ans=0.125 2023-06-27 21:04:02,393 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1896216.0, ans=0.0 2023-06-27 21:04:16,373 INFO [train.py:996] (2/4) Epoch 11, batch 11100, loss[loss=0.2067, simple_loss=0.2685, pruned_loss=0.07243, over 21776.00 frames. ], tot_loss[loss=0.2051, simple_loss=0.2802, pruned_loss=0.065, over 4259649.90 frames. ], batch size: 112, lr: 2.67e-03, grad_scale: 8.0 2023-06-27 21:04:16,792 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1896276.0, ans=0.125 2023-06-27 21:05:08,290 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1896396.0, ans=0.1 2023-06-27 21:05:55,702 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.559e+02 5.976e+02 8.582e+02 1.481e+03 2.937e+03, threshold=1.716e+03, percent-clipped=16.0 2023-06-27 21:05:56,168 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1896516.0, ans=0.1 2023-06-27 21:05:58,959 INFO [train.py:996] (2/4) Epoch 11, batch 11150, loss[loss=0.2075, simple_loss=0.3093, pruned_loss=0.0528, over 21597.00 frames. ], tot_loss[loss=0.2037, simple_loss=0.2785, pruned_loss=0.06451, over 4265806.48 frames. ], batch size: 230, lr: 2.67e-03, grad_scale: 8.0 2023-06-27 21:06:01,146 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1896576.0, ans=0.0 2023-06-27 21:06:07,728 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1896576.0, ans=0.0 2023-06-27 21:06:44,541 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=1896636.0, ans=15.0 2023-06-27 21:06:45,840 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1896696.0, ans=0.05 2023-06-27 21:07:01,488 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1896696.0, ans=0.1 2023-06-27 21:07:13,247 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1896756.0, ans=0.125 2023-06-27 21:07:42,372 INFO [train.py:996] (2/4) Epoch 11, batch 11200, loss[loss=0.2127, simple_loss=0.2794, pruned_loss=0.07303, over 21824.00 frames. ], tot_loss[loss=0.2027, simple_loss=0.2778, pruned_loss=0.06383, over 4258433.91 frames. ], batch size: 102, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 21:07:43,564 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.20 vs. limit=10.0 2023-06-27 21:08:09,746 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1896936.0, ans=0.125 2023-06-27 21:08:59,164 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 21:08:59,792 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.29 vs. limit=15.0 2023-06-27 21:09:20,938 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.297e+02 6.349e+02 8.470e+02 1.226e+03 2.540e+03, threshold=1.694e+03, percent-clipped=7.0 2023-06-27 21:09:24,606 INFO [train.py:996] (2/4) Epoch 11, batch 11250, loss[loss=0.2109, simple_loss=0.286, pruned_loss=0.06796, over 21466.00 frames. ], tot_loss[loss=0.203, simple_loss=0.2778, pruned_loss=0.06409, over 4256482.40 frames. ], batch size: 194, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 21:09:44,806 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1897236.0, ans=0.2 2023-06-27 21:10:07,241 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1897236.0, ans=0.2 2023-06-27 21:10:24,053 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 21:11:06,526 INFO [train.py:996] (2/4) Epoch 11, batch 11300, loss[loss=0.2046, simple_loss=0.2797, pruned_loss=0.06473, over 21314.00 frames. ], tot_loss[loss=0.2032, simple_loss=0.2784, pruned_loss=0.06402, over 4268837.40 frames. ], batch size: 159, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 21:11:13,601 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1897476.0, ans=0.05 2023-06-27 21:11:22,707 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.70 vs. limit=15.0 2023-06-27 21:11:25,262 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1897476.0, ans=0.125 2023-06-27 21:11:43,097 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 21:11:54,437 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1897596.0, ans=0.1 2023-06-27 21:12:04,439 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=1897596.0, ans=0.5 2023-06-27 21:12:29,350 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1897716.0, ans=0.0 2023-06-27 21:12:42,317 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1897716.0, ans=0.125 2023-06-27 21:12:44,920 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.908e+02 6.761e+02 9.436e+02 1.469e+03 2.612e+03, threshold=1.887e+03, percent-clipped=16.0 2023-06-27 21:12:48,356 INFO [train.py:996] (2/4) Epoch 11, batch 11350, loss[loss=0.1846, simple_loss=0.2671, pruned_loss=0.051, over 21618.00 frames. ], tot_loss[loss=0.2029, simple_loss=0.2793, pruned_loss=0.06324, over 4267025.62 frames. ], batch size: 263, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 21:13:08,528 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1897776.0, ans=0.1 2023-06-27 21:13:09,208 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.40 vs. limit=6.0 2023-06-27 21:13:25,724 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.32 vs. limit=8.0 2023-06-27 21:13:33,250 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1897836.0, ans=0.1 2023-06-27 21:13:36,630 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1897896.0, ans=0.125 2023-06-27 21:13:51,615 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1897956.0, ans=0.0 2023-06-27 21:14:30,957 INFO [train.py:996] (2/4) Epoch 11, batch 11400, loss[loss=0.218, simple_loss=0.3043, pruned_loss=0.06585, over 21708.00 frames. ], tot_loss[loss=0.2086, simple_loss=0.2856, pruned_loss=0.06585, over 4270335.06 frames. ], batch size: 298, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 21:15:27,300 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1898196.0, ans=0.125 2023-06-27 21:15:35,973 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1898256.0, ans=0.125 2023-06-27 21:16:05,688 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1898316.0, ans=0.125 2023-06-27 21:16:09,915 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.866e+02 7.031e+02 1.006e+03 1.495e+03 2.656e+03, threshold=2.011e+03, percent-clipped=10.0 2023-06-27 21:16:23,427 INFO [train.py:996] (2/4) Epoch 11, batch 11450, loss[loss=0.2286, simple_loss=0.3143, pruned_loss=0.07147, over 21587.00 frames. ], tot_loss[loss=0.2091, simple_loss=0.2871, pruned_loss=0.06554, over 4271165.34 frames. ], batch size: 414, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 21:17:01,042 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1898496.0, ans=0.1 2023-06-27 21:17:14,831 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1898496.0, ans=0.125 2023-06-27 21:17:18,541 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.77 vs. limit=15.0 2023-06-27 21:18:05,653 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1898676.0, ans=0.125 2023-06-27 21:18:06,607 INFO [train.py:996] (2/4) Epoch 11, batch 11500, loss[loss=0.2088, simple_loss=0.3076, pruned_loss=0.05503, over 21857.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.2892, pruned_loss=0.06657, over 4268358.66 frames. ], batch size: 371, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 21:19:18,665 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1898856.0, ans=0.2 2023-06-27 21:19:48,636 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.762e+02 6.937e+02 1.166e+03 1.634e+03 3.269e+03, threshold=2.333e+03, percent-clipped=13.0 2023-06-27 21:19:52,366 INFO [train.py:996] (2/4) Epoch 11, batch 11550, loss[loss=0.1737, simple_loss=0.2311, pruned_loss=0.0581, over 20739.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.2934, pruned_loss=0.06652, over 4261951.60 frames. ], batch size: 608, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 21:20:02,197 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1898976.0, ans=0.0 2023-06-27 21:20:09,315 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1899036.0, ans=0.125 2023-06-27 21:20:10,891 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1899036.0, ans=0.07 2023-06-27 21:21:20,487 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1899216.0, ans=0.1 2023-06-27 21:21:25,367 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1899216.0, ans=0.125 2023-06-27 21:21:38,051 INFO [train.py:996] (2/4) Epoch 11, batch 11600, loss[loss=0.2312, simple_loss=0.3264, pruned_loss=0.06803, over 21422.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.3087, pruned_loss=0.06852, over 4255797.86 frames. ], batch size: 131, lr: 2.67e-03, grad_scale: 32.0 2023-06-27 21:21:50,341 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1899276.0, ans=0.125 2023-06-27 21:21:59,550 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=13.86 vs. limit=15.0 2023-06-27 21:22:43,155 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.46 vs. limit=15.0 2023-06-27 21:23:15,068 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.239e+02 7.630e+02 1.389e+03 2.274e+03 4.713e+03, threshold=2.778e+03, percent-clipped=21.0 2023-06-27 21:23:16,791 INFO [train.py:996] (2/4) Epoch 11, batch 11650, loss[loss=0.2184, simple_loss=0.3019, pruned_loss=0.06749, over 21265.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.3155, pruned_loss=0.06959, over 4257642.68 frames. ], batch size: 549, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 21:23:31,138 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1899576.0, ans=0.125 2023-06-27 21:23:31,688 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.13 vs. limit=15.0 2023-06-27 21:23:45,726 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1899636.0, ans=0.0 2023-06-27 21:24:28,340 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1899756.0, ans=0.1 2023-06-27 21:24:38,694 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.36 vs. limit=15.0 2023-06-27 21:24:41,102 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1899816.0, ans=0.5 2023-06-27 21:24:53,628 INFO [train.py:996] (2/4) Epoch 11, batch 11700, loss[loss=0.1933, simple_loss=0.259, pruned_loss=0.06384, over 21842.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.3058, pruned_loss=0.0688, over 4252917.43 frames. ], batch size: 373, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 21:24:54,381 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1899876.0, ans=0.125 2023-06-27 21:25:22,239 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1899936.0, ans=0.0 2023-06-27 21:26:04,033 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=1900056.0, ans=15.0 2023-06-27 21:26:28,324 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.521e+02 7.314e+02 1.090e+03 1.615e+03 2.478e+03, threshold=2.180e+03, percent-clipped=0.0 2023-06-27 21:26:29,961 INFO [train.py:996] (2/4) Epoch 11, batch 11750, loss[loss=0.2076, simple_loss=0.2746, pruned_loss=0.07032, over 21827.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2968, pruned_loss=0.06817, over 4249423.35 frames. ], batch size: 98, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 21:26:42,857 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.49 vs. limit=6.0 2023-06-27 21:27:00,932 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1900236.0, ans=0.035 2023-06-27 21:27:35,011 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.94 vs. limit=15.0 2023-06-27 21:27:38,100 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1900356.0, ans=0.2 2023-06-27 21:27:51,348 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1900416.0, ans=0.0 2023-06-27 21:28:08,240 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.05 vs. limit=15.0 2023-06-27 21:28:08,599 INFO [train.py:996] (2/4) Epoch 11, batch 11800, loss[loss=0.2212, simple_loss=0.303, pruned_loss=0.06969, over 21701.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.2986, pruned_loss=0.07008, over 4255228.88 frames. ], batch size: 298, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 21:28:22,663 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1900476.0, ans=0.125 2023-06-27 21:28:35,749 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1900536.0, ans=0.125 2023-06-27 21:28:42,347 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1900536.0, ans=0.125 2023-06-27 21:29:43,833 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1900716.0, ans=0.2 2023-06-27 21:29:44,973 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.871e+02 7.084e+02 9.791e+02 1.465e+03 2.454e+03, threshold=1.958e+03, percent-clipped=4.0 2023-06-27 21:29:46,637 INFO [train.py:996] (2/4) Epoch 11, batch 11850, loss[loss=0.2115, simple_loss=0.3138, pruned_loss=0.05461, over 21700.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.2988, pruned_loss=0.06877, over 4262845.58 frames. ], batch size: 389, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 21:29:56,674 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.27 vs. limit=15.0 2023-06-27 21:30:37,124 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1900896.0, ans=0.1 2023-06-27 21:30:56,040 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1900956.0, ans=0.125 2023-06-27 21:31:10,305 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.19 vs. limit=10.0 2023-06-27 21:31:25,868 INFO [train.py:996] (2/4) Epoch 11, batch 11900, loss[loss=0.197, simple_loss=0.2803, pruned_loss=0.05691, over 21662.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.3, pruned_loss=0.06693, over 4267317.24 frames. ], batch size: 298, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 21:32:03,192 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1901136.0, ans=0.0 2023-06-27 21:32:44,481 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.15 vs. limit=22.5 2023-06-27 21:33:04,911 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.61 vs. limit=15.0 2023-06-27 21:33:08,455 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.313e+02 5.634e+02 7.548e+02 1.171e+03 3.128e+03, threshold=1.510e+03, percent-clipped=7.0 2023-06-27 21:33:14,928 INFO [train.py:996] (2/4) Epoch 11, batch 11950, loss[loss=0.2413, simple_loss=0.3424, pruned_loss=0.07012, over 21598.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.2997, pruned_loss=0.06389, over 4258447.62 frames. ], batch size: 441, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 21:33:49,904 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1901436.0, ans=0.0 2023-06-27 21:34:22,864 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1901556.0, ans=0.0 2023-06-27 21:34:35,445 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.78 vs. limit=15.0 2023-06-27 21:34:52,072 INFO [train.py:996] (2/4) Epoch 11, batch 12000, loss[loss=0.1949, simple_loss=0.2663, pruned_loss=0.06181, over 21800.00 frames. ], tot_loss[loss=0.2092, simple_loss=0.294, pruned_loss=0.06215, over 4264063.62 frames. ], batch size: 352, lr: 2.67e-03, grad_scale: 32.0 2023-06-27 21:34:52,073 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-27 21:35:07,526 INFO [zipformer.py:1728] (2/4) name=encoder.encoders.3.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([2.2460, 3.0145, 3.2305, 3.3859, 2.8689, 2.7848, 3.4286, 3.3494], device='cuda:2') 2023-06-27 21:35:12,140 INFO [train.py:1028] (2/4) Epoch 11, validation: loss=0.2616, simple_loss=0.3513, pruned_loss=0.08594, over 1796401.00 frames. 2023-06-27 21:35:12,140 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 23793MB 2023-06-27 21:35:47,746 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1901736.0, ans=0.2 2023-06-27 21:36:00,620 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1901796.0, ans=0.04949747468305833 2023-06-27 21:36:11,519 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.70 vs. limit=15.0 2023-06-27 21:36:59,870 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.508e+02 6.629e+02 1.038e+03 1.682e+03 4.454e+03, threshold=2.077e+03, percent-clipped=31.0 2023-06-27 21:36:59,901 INFO [train.py:996] (2/4) Epoch 11, batch 12050, loss[loss=0.2207, simple_loss=0.2925, pruned_loss=0.07446, over 21812.00 frames. ], tot_loss[loss=0.2098, simple_loss=0.2921, pruned_loss=0.06379, over 4266944.84 frames. ], batch size: 391, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 21:37:22,030 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1902036.0, ans=0.125 2023-06-27 21:37:40,563 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1902096.0, ans=0.125 2023-06-27 21:38:43,461 INFO [train.py:996] (2/4) Epoch 11, batch 12100, loss[loss=0.2623, simple_loss=0.3255, pruned_loss=0.09954, over 21820.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2964, pruned_loss=0.06799, over 4272316.23 frames. ], batch size: 441, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 21:39:08,789 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1902336.0, ans=0.0 2023-06-27 21:39:24,824 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.87 vs. limit=15.0 2023-06-27 21:39:25,929 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1902396.0, ans=0.0 2023-06-27 21:39:33,113 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1902396.0, ans=0.0 2023-06-27 21:40:29,114 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.049e+02 8.933e+02 1.358e+03 2.118e+03 4.417e+03, threshold=2.716e+03, percent-clipped=26.0 2023-06-27 21:40:29,146 INFO [train.py:996] (2/4) Epoch 11, batch 12150, loss[loss=0.2476, simple_loss=0.3656, pruned_loss=0.06482, over 19767.00 frames. ], tot_loss[loss=0.2183, simple_loss=0.3002, pruned_loss=0.06816, over 4269933.41 frames. ], batch size: 702, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 21:40:31,398 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1902576.0, ans=0.125 2023-06-27 21:40:38,765 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.11 vs. limit=22.5 2023-06-27 21:40:52,107 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.82 vs. limit=12.0 2023-06-27 21:41:03,961 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.35 vs. limit=6.0 2023-06-27 21:41:23,922 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1902696.0, ans=0.0 2023-06-27 21:41:28,544 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1902756.0, ans=0.125 2023-06-27 21:41:33,480 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1902756.0, ans=0.1 2023-06-27 21:41:59,433 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.69 vs. limit=15.0 2023-06-27 21:42:04,483 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.01 vs. limit=6.0 2023-06-27 21:42:09,519 INFO [train.py:996] (2/4) Epoch 11, batch 12200, loss[loss=0.1802, simple_loss=0.2536, pruned_loss=0.05342, over 21803.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2976, pruned_loss=0.06695, over 4268620.39 frames. ], batch size: 352, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 21:42:48,019 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1902996.0, ans=0.0 2023-06-27 21:43:02,798 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1902996.0, ans=0.125 2023-06-27 21:43:30,092 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1903056.0, ans=0.1 2023-06-27 21:43:38,442 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1903116.0, ans=0.125 2023-06-27 21:43:50,746 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.836e+02 6.388e+02 1.060e+03 1.811e+03 4.082e+03, threshold=2.119e+03, percent-clipped=7.0 2023-06-27 21:43:50,793 INFO [train.py:996] (2/4) Epoch 11, batch 12250, loss[loss=0.1629, simple_loss=0.2539, pruned_loss=0.03601, over 21701.00 frames. ], tot_loss[loss=0.2087, simple_loss=0.2894, pruned_loss=0.06406, over 4268123.49 frames. ], batch size: 332, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 21:45:05,057 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1903356.0, ans=0.125 2023-06-27 21:45:31,737 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1903416.0, ans=0.125 2023-06-27 21:45:34,141 INFO [train.py:996] (2/4) Epoch 11, batch 12300, loss[loss=0.2267, simple_loss=0.3209, pruned_loss=0.06624, over 21672.00 frames. ], tot_loss[loss=0.2013, simple_loss=0.2831, pruned_loss=0.05977, over 4270392.10 frames. ], batch size: 414, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 21:45:34,673 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1903476.0, ans=0.125 2023-06-27 21:45:34,777 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1903476.0, ans=0.125 2023-06-27 21:46:03,742 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1903536.0, ans=0.0 2023-06-27 21:46:18,574 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1903596.0, ans=0.125 2023-06-27 21:46:20,143 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1903596.0, ans=0.2 2023-06-27 21:46:56,460 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.42 vs. limit=15.0 2023-06-27 21:46:58,370 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.50 vs. limit=15.0 2023-06-27 21:47:16,524 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.514e+02 6.456e+02 1.093e+03 1.764e+03 5.046e+03, threshold=2.186e+03, percent-clipped=16.0 2023-06-27 21:47:16,570 INFO [train.py:996] (2/4) Epoch 11, batch 12350, loss[loss=0.2209, simple_loss=0.2985, pruned_loss=0.07162, over 21919.00 frames. ], tot_loss[loss=0.2034, simple_loss=0.2866, pruned_loss=0.06013, over 4271049.70 frames. ], batch size: 316, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 21:47:25,423 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1903776.0, ans=0.2 2023-06-27 21:47:39,200 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.41 vs. limit=15.0 2023-06-27 21:48:49,614 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 21:48:56,316 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1904076.0, ans=0.125 2023-06-27 21:48:57,253 INFO [train.py:996] (2/4) Epoch 11, batch 12400, loss[loss=0.2229, simple_loss=0.29, pruned_loss=0.07786, over 21411.00 frames. ], tot_loss[loss=0.2072, simple_loss=0.288, pruned_loss=0.0632, over 4277325.32 frames. ], batch size: 159, lr: 2.66e-03, grad_scale: 32.0 2023-06-27 21:49:17,805 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1904136.0, ans=0.125 2023-06-27 21:49:26,143 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1904136.0, ans=0.0 2023-06-27 21:49:29,898 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.56 vs. limit=12.0 2023-06-27 21:50:22,099 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.30 vs. limit=6.0 2023-06-27 21:50:25,542 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.83 vs. limit=22.5 2023-06-27 21:50:39,832 INFO [train.py:996] (2/4) Epoch 11, batch 12450, loss[loss=0.2387, simple_loss=0.316, pruned_loss=0.08067, over 21608.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.2923, pruned_loss=0.06624, over 4281798.26 frames. ], batch size: 263, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 21:50:41,638 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.549e+02 6.531e+02 8.502e+02 1.313e+03 3.916e+03, threshold=1.700e+03, percent-clipped=4.0 2023-06-27 21:50:45,924 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1904376.0, ans=0.0 2023-06-27 21:51:04,676 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1904436.0, ans=0.125 2023-06-27 21:51:04,736 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1904436.0, ans=0.125 2023-06-27 21:51:20,139 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1904496.0, ans=0.0 2023-06-27 21:51:53,172 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1904556.0, ans=0.125 2023-06-27 21:52:00,090 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1904556.0, ans=0.1 2023-06-27 21:52:29,358 INFO [train.py:996] (2/4) Epoch 11, batch 12500, loss[loss=0.2423, simple_loss=0.3456, pruned_loss=0.06947, over 21594.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.3039, pruned_loss=0.06877, over 4281186.16 frames. ], batch size: 389, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 21:52:41,042 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.96 vs. limit=22.5 2023-06-27 21:53:38,351 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1904856.0, ans=0.125 2023-06-27 21:53:51,797 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1904916.0, ans=0.0 2023-06-27 21:54:02,255 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1904916.0, ans=0.125 2023-06-27 21:54:05,544 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1904916.0, ans=0.125 2023-06-27 21:54:09,041 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1904976.0, ans=0.0 2023-06-27 21:54:10,060 INFO [train.py:996] (2/4) Epoch 11, batch 12550, loss[loss=0.2054, simple_loss=0.299, pruned_loss=0.05587, over 21731.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.3063, pruned_loss=0.07036, over 4275235.38 frames. ], batch size: 298, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 21:54:11,832 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.997e+02 7.151e+02 9.738e+02 1.410e+03 2.995e+03, threshold=1.948e+03, percent-clipped=12.0 2023-06-27 21:55:06,271 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1905096.0, ans=0.0 2023-06-27 21:55:24,586 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1905156.0, ans=0.0 2023-06-27 21:55:43,404 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.15 vs. limit=10.0 2023-06-27 21:55:53,702 INFO [train.py:996] (2/4) Epoch 11, batch 12600, loss[loss=0.1964, simple_loss=0.2896, pruned_loss=0.05157, over 21784.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.3062, pruned_loss=0.06874, over 4276148.34 frames. ], batch size: 352, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 21:56:05,762 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1905276.0, ans=0.125 2023-06-27 21:56:20,680 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 21:56:30,476 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1905396.0, ans=0.125 2023-06-27 21:57:00,881 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.09 vs. limit=6.0 2023-06-27 21:57:09,789 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1905516.0, ans=0.1 2023-06-27 21:57:30,672 INFO [train.py:996] (2/4) Epoch 11, batch 12650, loss[loss=0.2065, simple_loss=0.2983, pruned_loss=0.05733, over 19880.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.298, pruned_loss=0.06513, over 4272537.04 frames. ], batch size: 702, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 21:57:36,982 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.857e+02 6.084e+02 8.925e+02 1.601e+03 4.127e+03, threshold=1.785e+03, percent-clipped=11.0 2023-06-27 21:58:01,865 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 21:58:07,117 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1905636.0, ans=0.125 2023-06-27 21:58:23,737 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1905696.0, ans=0.2 2023-06-27 21:58:35,133 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1905756.0, ans=0.1 2023-06-27 21:59:14,896 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1905816.0, ans=0.0 2023-06-27 21:59:17,353 INFO [train.py:996] (2/4) Epoch 11, batch 12700, loss[loss=0.2245, simple_loss=0.3016, pruned_loss=0.07365, over 21261.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.297, pruned_loss=0.06739, over 4281514.01 frames. ], batch size: 176, lr: 2.66e-03, grad_scale: 8.0 2023-06-27 21:59:40,421 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.78 vs. limit=22.5 2023-06-27 22:00:07,827 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1905996.0, ans=0.125 2023-06-27 22:00:13,018 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1906056.0, ans=0.125 2023-06-27 22:00:21,759 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.32 vs. limit=15.0 2023-06-27 22:00:35,980 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1906116.0, ans=0.2 2023-06-27 22:00:59,930 INFO [train.py:996] (2/4) Epoch 11, batch 12750, loss[loss=0.199, simple_loss=0.2841, pruned_loss=0.05693, over 21753.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.2978, pruned_loss=0.06789, over 4281954.91 frames. ], batch size: 282, lr: 2.66e-03, grad_scale: 8.0 2023-06-27 22:01:03,058 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.884e+02 6.384e+02 9.703e+02 1.626e+03 3.460e+03, threshold=1.941e+03, percent-clipped=17.0 2023-06-27 22:01:03,775 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1906176.0, ans=0.125 2023-06-27 22:01:03,860 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1906176.0, ans=0.125 2023-06-27 22:01:04,372 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.31 vs. limit=15.0 2023-06-27 22:02:37,436 INFO [train.py:996] (2/4) Epoch 11, batch 12800, loss[loss=0.2319, simple_loss=0.3222, pruned_loss=0.07075, over 20756.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2981, pruned_loss=0.06856, over 4279288.74 frames. ], batch size: 607, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:03:03,815 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1906536.0, ans=0.0 2023-06-27 22:04:13,726 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1906716.0, ans=0.125 2023-06-27 22:04:16,513 INFO [train.py:996] (2/4) Epoch 11, batch 12850, loss[loss=0.1953, simple_loss=0.2835, pruned_loss=0.05351, over 21611.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.3003, pruned_loss=0.0702, over 4277273.76 frames. ], batch size: 230, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:04:19,914 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.496e+02 6.619e+02 8.381e+02 1.196e+03 2.769e+03, threshold=1.676e+03, percent-clipped=10.0 2023-06-27 22:04:54,570 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1906896.0, ans=0.035 2023-06-27 22:05:09,522 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1906896.0, ans=0.125 2023-06-27 22:05:31,900 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.15 vs. limit=15.0 2023-06-27 22:05:52,199 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.09 vs. limit=15.0 2023-06-27 22:05:56,186 INFO [train.py:996] (2/4) Epoch 11, batch 12900, loss[loss=0.2108, simple_loss=0.3022, pruned_loss=0.05966, over 21714.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2991, pruned_loss=0.06664, over 4280007.40 frames. ], batch size: 391, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:06:09,374 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1907076.0, ans=0.125 2023-06-27 22:07:00,082 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1907196.0, ans=0.125 2023-06-27 22:07:38,239 INFO [train.py:996] (2/4) Epoch 11, batch 12950, loss[loss=0.241, simple_loss=0.3189, pruned_loss=0.08153, over 21460.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.2962, pruned_loss=0.06502, over 4277876.84 frames. ], batch size: 471, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:07:46,036 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.286e+02 5.649e+02 7.456e+02 9.840e+02 3.735e+03, threshold=1.491e+03, percent-clipped=7.0 2023-06-27 22:07:51,092 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.75 vs. limit=15.0 2023-06-27 22:07:55,964 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.98 vs. limit=15.0 2023-06-27 22:09:04,302 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1907616.0, ans=0.125 2023-06-27 22:09:19,448 INFO [train.py:996] (2/4) Epoch 11, batch 13000, loss[loss=0.1828, simple_loss=0.2715, pruned_loss=0.0471, over 21795.00 frames. ], tot_loss[loss=0.2133, simple_loss=0.2955, pruned_loss=0.06553, over 4278216.13 frames. ], batch size: 372, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:09:31,541 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1907676.0, ans=0.125 2023-06-27 22:10:33,485 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1907856.0, ans=0.0 2023-06-27 22:10:37,448 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=13.35 vs. limit=15.0 2023-06-27 22:11:06,076 INFO [train.py:996] (2/4) Epoch 11, batch 13050, loss[loss=0.2188, simple_loss=0.2816, pruned_loss=0.07799, over 21596.00 frames. ], tot_loss[loss=0.2088, simple_loss=0.2906, pruned_loss=0.06356, over 4271952.67 frames. ], batch size: 548, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:11:09,269 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.155e+02 7.836e+02 1.180e+03 1.629e+03 3.232e+03, threshold=2.361e+03, percent-clipped=34.0 2023-06-27 22:12:14,529 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1908156.0, ans=0.125 2023-06-27 22:12:43,805 INFO [train.py:996] (2/4) Epoch 11, batch 13100, loss[loss=0.2501, simple_loss=0.3279, pruned_loss=0.08612, over 21801.00 frames. ], tot_loss[loss=0.2087, simple_loss=0.2914, pruned_loss=0.063, over 4279038.86 frames. ], batch size: 441, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:12:51,222 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1908276.0, ans=0.125 2023-06-27 22:14:19,044 INFO [train.py:996] (2/4) Epoch 11, batch 13150, loss[loss=0.2367, simple_loss=0.3055, pruned_loss=0.08396, over 21391.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2957, pruned_loss=0.0656, over 4277000.61 frames. ], batch size: 471, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:14:22,234 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.348e+02 6.692e+02 9.537e+02 1.354e+03 2.505e+03, threshold=1.907e+03, percent-clipped=1.0 2023-06-27 22:14:57,932 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.54 vs. limit=15.0 2023-06-27 22:15:08,797 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.98 vs. limit=22.5 2023-06-27 22:15:13,074 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=1908696.0, ans=0.5 2023-06-27 22:15:18,030 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1908696.0, ans=0.025 2023-06-27 22:16:06,221 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1908876.0, ans=0.0 2023-06-27 22:16:12,338 INFO [train.py:996] (2/4) Epoch 11, batch 13200, loss[loss=0.2107, simple_loss=0.2884, pruned_loss=0.06649, over 21814.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2973, pruned_loss=0.06539, over 4271810.25 frames. ], batch size: 282, lr: 2.66e-03, grad_scale: 32.0 2023-06-27 22:16:38,021 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1908936.0, ans=0.2 2023-06-27 22:17:49,659 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1909176.0, ans=0.1 2023-06-27 22:17:50,840 INFO [train.py:996] (2/4) Epoch 11, batch 13250, loss[loss=0.209, simple_loss=0.288, pruned_loss=0.06499, over 21724.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2975, pruned_loss=0.0675, over 4272964.93 frames. ], batch size: 247, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:17:55,792 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.059e+02 8.358e+02 1.341e+03 1.799e+03 2.954e+03, threshold=2.682e+03, percent-clipped=21.0 2023-06-27 22:18:15,360 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.71 vs. limit=10.0 2023-06-27 22:18:16,373 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1909236.0, ans=0.125 2023-06-27 22:18:19,843 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 22:18:44,797 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1909356.0, ans=0.0 2023-06-27 22:18:51,488 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1909356.0, ans=0.125 2023-06-27 22:19:34,355 INFO [train.py:996] (2/4) Epoch 11, batch 13300, loss[loss=0.2388, simple_loss=0.3223, pruned_loss=0.07761, over 21780.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2988, pruned_loss=0.06715, over 4272117.42 frames. ], batch size: 332, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:19:36,864 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1909476.0, ans=0.125 2023-06-27 22:19:50,682 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1909536.0, ans=0.125 2023-06-27 22:19:52,278 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1909536.0, ans=0.1 2023-06-27 22:19:55,715 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1909536.0, ans=0.125 2023-06-27 22:20:40,079 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1909656.0, ans=0.0 2023-06-27 22:20:57,760 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1909656.0, ans=0.125 2023-06-27 22:21:02,700 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1909716.0, ans=0.2 2023-06-27 22:21:04,543 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1909716.0, ans=0.125 2023-06-27 22:21:12,943 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1909716.0, ans=0.125 2023-06-27 22:21:18,950 INFO [train.py:996] (2/4) Epoch 11, batch 13350, loss[loss=0.1882, simple_loss=0.2882, pruned_loss=0.04406, over 20804.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.3023, pruned_loss=0.06942, over 4268962.08 frames. ], batch size: 609, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:21:23,888 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.586e+02 8.555e+02 1.217e+03 1.843e+03 4.034e+03, threshold=2.434e+03, percent-clipped=8.0 2023-06-27 22:21:29,408 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1909776.0, ans=0.0 2023-06-27 22:21:32,801 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1909776.0, ans=0.1 2023-06-27 22:21:44,960 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.61 vs. limit=15.0 2023-06-27 22:23:00,812 INFO [train.py:996] (2/4) Epoch 11, batch 13400, loss[loss=0.218, simple_loss=0.2983, pruned_loss=0.06885, over 21444.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.3033, pruned_loss=0.07144, over 4276754.61 frames. ], batch size: 194, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:24:43,495 INFO [train.py:996] (2/4) Epoch 11, batch 13450, loss[loss=0.2027, simple_loss=0.275, pruned_loss=0.06518, over 21600.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.3048, pruned_loss=0.07289, over 4271352.54 frames. ], batch size: 263, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:24:52,953 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.204e+02 6.376e+02 8.067e+02 1.099e+03 2.577e+03, threshold=1.613e+03, percent-clipped=1.0 2023-06-27 22:25:53,743 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1910556.0, ans=0.0 2023-06-27 22:26:31,893 INFO [train.py:996] (2/4) Epoch 11, batch 13500, loss[loss=0.195, simple_loss=0.272, pruned_loss=0.05899, over 21718.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.2979, pruned_loss=0.0705, over 4258051.29 frames. ], batch size: 298, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:26:35,891 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1910676.0, ans=0.0 2023-06-27 22:27:40,895 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1910856.0, ans=0.1 2023-06-27 22:27:59,166 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.65 vs. limit=12.0 2023-06-27 22:28:11,316 INFO [train.py:996] (2/4) Epoch 11, batch 13550, loss[loss=0.2304, simple_loss=0.3273, pruned_loss=0.06673, over 21423.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.3002, pruned_loss=0.06914, over 4264748.37 frames. ], batch size: 211, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:28:13,434 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1910976.0, ans=0.0 2023-06-27 22:28:16,104 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.678e+02 8.007e+02 1.277e+03 1.961e+03 4.546e+03, threshold=2.554e+03, percent-clipped=33.0 2023-06-27 22:28:18,626 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1910976.0, ans=0.125 2023-06-27 22:28:45,385 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.67 vs. limit=8.0 2023-06-27 22:29:01,101 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1911096.0, ans=0.125 2023-06-27 22:29:01,189 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1911096.0, ans=0.125 2023-06-27 22:29:42,385 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1911216.0, ans=0.0 2023-06-27 22:29:53,214 INFO [train.py:996] (2/4) Epoch 11, batch 13600, loss[loss=0.2325, simple_loss=0.3031, pruned_loss=0.08097, over 21792.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.2992, pruned_loss=0.06967, over 4269208.70 frames. ], batch size: 441, lr: 2.66e-03, grad_scale: 32.0 2023-06-27 22:30:44,083 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1911396.0, ans=0.125 2023-06-27 22:30:44,195 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1911396.0, ans=0.0 2023-06-27 22:30:50,850 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1911396.0, ans=0.0 2023-06-27 22:31:32,072 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=1911516.0, ans=15.0 2023-06-27 22:31:34,442 INFO [train.py:996] (2/4) Epoch 11, batch 13650, loss[loss=0.1746, simple_loss=0.2396, pruned_loss=0.0548, over 21629.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.295, pruned_loss=0.06727, over 4272303.17 frames. ], batch size: 231, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:31:37,434 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.61 vs. limit=15.0 2023-06-27 22:31:45,847 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.126e+02 5.497e+02 8.318e+02 1.364e+03 3.376e+03, threshold=1.664e+03, percent-clipped=5.0 2023-06-27 22:31:48,063 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1911576.0, ans=0.125 2023-06-27 22:32:45,287 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1911756.0, ans=0.0 2023-06-27 22:32:47,048 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1911756.0, ans=0.2 2023-06-27 22:33:08,122 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.28 vs. limit=15.0 2023-06-27 22:33:13,820 INFO [train.py:996] (2/4) Epoch 11, batch 13700, loss[loss=0.188, simple_loss=0.2584, pruned_loss=0.05877, over 21449.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.2911, pruned_loss=0.06688, over 4277147.71 frames. ], batch size: 211, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:33:50,102 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1911936.0, ans=0.1 2023-06-27 22:34:43,939 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1912116.0, ans=0.0 2023-06-27 22:35:01,859 INFO [train.py:996] (2/4) Epoch 11, batch 13750, loss[loss=0.2159, simple_loss=0.2956, pruned_loss=0.06807, over 21626.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.2897, pruned_loss=0.06658, over 4278888.72 frames. ], batch size: 414, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:35:05,913 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1912176.0, ans=0.0 2023-06-27 22:35:13,335 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.554e+02 7.211e+02 1.142e+03 1.644e+03 3.975e+03, threshold=2.283e+03, percent-clipped=24.0 2023-06-27 22:35:15,733 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1912176.0, ans=0.125 2023-06-27 22:36:41,678 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1912416.0, ans=0.0 2023-06-27 22:36:47,742 INFO [train.py:996] (2/4) Epoch 11, batch 13800, loss[loss=0.1775, simple_loss=0.2949, pruned_loss=0.03003, over 19773.00 frames. ], tot_loss[loss=0.2107, simple_loss=0.2921, pruned_loss=0.06463, over 4274050.50 frames. ], batch size: 703, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:37:00,729 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 22:37:47,318 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1912656.0, ans=0.1 2023-06-27 22:37:50,489 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1912656.0, ans=0.0 2023-06-27 22:38:31,510 INFO [train.py:996] (2/4) Epoch 11, batch 13850, loss[loss=0.3265, simple_loss=0.3933, pruned_loss=0.1298, over 21480.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.3002, pruned_loss=0.06639, over 4273680.39 frames. ], batch size: 507, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:38:32,424 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1912776.0, ans=0.125 2023-06-27 22:38:35,321 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1912776.0, ans=10.0 2023-06-27 22:38:37,687 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.61 vs. limit=22.5 2023-06-27 22:38:38,141 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.536e+02 6.806e+02 9.243e+02 1.369e+03 3.206e+03, threshold=1.849e+03, percent-clipped=7.0 2023-06-27 22:38:38,933 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1912776.0, ans=0.125 2023-06-27 22:38:50,285 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1912776.0, ans=0.1 2023-06-27 22:39:04,758 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1912836.0, ans=0.0 2023-06-27 22:39:06,862 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.97 vs. limit=15.0 2023-06-27 22:39:55,059 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=5.65 vs. limit=22.5 2023-06-27 22:40:12,063 INFO [train.py:996] (2/4) Epoch 11, batch 13900, loss[loss=0.2528, simple_loss=0.3292, pruned_loss=0.0882, over 21358.00 frames. ], tot_loss[loss=0.2212, simple_loss=0.3039, pruned_loss=0.06926, over 4281091.01 frames. ], batch size: 143, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:41:13,425 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1913256.0, ans=0.0 2023-06-27 22:41:24,954 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1913256.0, ans=0.125 2023-06-27 22:41:36,818 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1913316.0, ans=0.125 2023-06-27 22:41:49,824 INFO [train.py:996] (2/4) Epoch 11, batch 13950, loss[loss=0.2094, simple_loss=0.2863, pruned_loss=0.06621, over 21774.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.3043, pruned_loss=0.07045, over 4287914.63 frames. ], batch size: 282, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:42:00,988 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.608e+02 7.245e+02 1.112e+03 1.601e+03 2.924e+03, threshold=2.224e+03, percent-clipped=16.0 2023-06-27 22:42:14,806 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1913436.0, ans=0.125 2023-06-27 22:42:35,876 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.48 vs. limit=15.0 2023-06-27 22:43:29,853 INFO [train.py:996] (2/4) Epoch 11, batch 14000, loss[loss=0.199, simple_loss=0.2821, pruned_loss=0.05792, over 21418.00 frames. ], tot_loss[loss=0.2183, simple_loss=0.2996, pruned_loss=0.06851, over 4278089.55 frames. ], batch size: 548, lr: 2.66e-03, grad_scale: 32.0 2023-06-27 22:43:38,682 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1913676.0, ans=0.0 2023-06-27 22:44:23,261 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1913796.0, ans=0.125 2023-06-27 22:44:46,544 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.33 vs. limit=15.0 2023-06-27 22:44:49,566 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=12.60 vs. limit=15.0 2023-06-27 22:44:57,107 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1913916.0, ans=0.1 2023-06-27 22:45:05,019 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1913916.0, ans=0.0 2023-06-27 22:45:15,873 INFO [train.py:996] (2/4) Epoch 11, batch 14050, loss[loss=0.2043, simple_loss=0.2713, pruned_loss=0.06867, over 21584.00 frames. ], tot_loss[loss=0.2125, simple_loss=0.2944, pruned_loss=0.0653, over 4277623.33 frames. ], batch size: 414, lr: 2.66e-03, grad_scale: 32.0 2023-06-27 22:45:22,438 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.282e+02 6.606e+02 1.013e+03 1.561e+03 3.162e+03, threshold=2.026e+03, percent-clipped=9.0 2023-06-27 22:45:22,913 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1913976.0, ans=0.0 2023-06-27 22:45:24,590 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1913976.0, ans=0.125 2023-06-27 22:45:40,520 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1914036.0, ans=0.0 2023-06-27 22:46:37,622 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1914216.0, ans=0.125 2023-06-27 22:46:57,096 INFO [train.py:996] (2/4) Epoch 11, batch 14100, loss[loss=0.1935, simple_loss=0.2645, pruned_loss=0.06122, over 21741.00 frames. ], tot_loss[loss=0.2097, simple_loss=0.2885, pruned_loss=0.06549, over 4264712.11 frames. ], batch size: 351, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:46:57,628 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1914276.0, ans=0.07 2023-06-27 22:47:41,102 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1914396.0, ans=0.1 2023-06-27 22:47:41,121 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1914396.0, ans=0.125 2023-06-27 22:47:44,342 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1914396.0, ans=0.07 2023-06-27 22:47:45,996 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1914396.0, ans=0.0 2023-06-27 22:47:57,631 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1914456.0, ans=0.0 2023-06-27 22:47:58,095 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.71 vs. limit=15.0 2023-06-27 22:48:31,949 INFO [train.py:996] (2/4) Epoch 11, batch 14150, loss[loss=0.2135, simple_loss=0.3054, pruned_loss=0.06082, over 21812.00 frames. ], tot_loss[loss=0.2119, simple_loss=0.2916, pruned_loss=0.06605, over 4262002.19 frames. ], batch size: 118, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:48:32,428 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1914576.0, ans=0.125 2023-06-27 22:48:44,404 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.197e+02 6.392e+02 8.340e+02 1.310e+03 2.692e+03, threshold=1.668e+03, percent-clipped=6.0 2023-06-27 22:49:02,998 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.71 vs. limit=12.0 2023-06-27 22:49:08,726 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1914696.0, ans=0.0 2023-06-27 22:49:16,370 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1914696.0, ans=0.0 2023-06-27 22:49:16,394 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1914696.0, ans=0.0 2023-06-27 22:50:10,691 INFO [train.py:996] (2/4) Epoch 11, batch 14200, loss[loss=0.2268, simple_loss=0.3001, pruned_loss=0.07677, over 21803.00 frames. ], tot_loss[loss=0.21, simple_loss=0.291, pruned_loss=0.06454, over 4268343.00 frames. ], batch size: 118, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:50:13,166 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1914876.0, ans=0.0 2023-06-27 22:50:33,204 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.90 vs. limit=10.0 2023-06-27 22:50:35,414 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1914936.0, ans=0.125 2023-06-27 22:50:43,520 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1914936.0, ans=0.09899494936611666 2023-06-27 22:50:53,073 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1914996.0, ans=0.125 2023-06-27 22:51:25,976 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.45 vs. limit=15.0 2023-06-27 22:51:50,595 INFO [train.py:996] (2/4) Epoch 11, batch 14250, loss[loss=0.1728, simple_loss=0.2493, pruned_loss=0.04818, over 21504.00 frames. ], tot_loss[loss=0.2075, simple_loss=0.286, pruned_loss=0.06455, over 4269433.17 frames. ], batch size: 230, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:51:52,982 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 22:51:59,420 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.518e+02 7.474e+02 9.961e+02 1.736e+03 2.961e+03, threshold=1.992e+03, percent-clipped=26.0 2023-06-27 22:52:05,708 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1915176.0, ans=0.125 2023-06-27 22:52:07,853 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.31 vs. limit=15.0 2023-06-27 22:52:20,841 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.88 vs. limit=15.0 2023-06-27 22:52:32,667 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1915296.0, ans=0.125 2023-06-27 22:53:31,608 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.33 vs. limit=15.0 2023-06-27 22:53:34,355 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1915476.0, ans=0.125 2023-06-27 22:53:35,417 INFO [train.py:996] (2/4) Epoch 11, batch 14300, loss[loss=0.3354, simple_loss=0.4202, pruned_loss=0.1253, over 21570.00 frames. ], tot_loss[loss=0.2067, simple_loss=0.2861, pruned_loss=0.06365, over 4256664.53 frames. ], batch size: 441, lr: 2.66e-03, grad_scale: 8.0 2023-06-27 22:53:46,794 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1915476.0, ans=0.1 2023-06-27 22:54:23,713 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1915596.0, ans=0.05 2023-06-27 22:55:05,809 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1915716.0, ans=0.09899494936611666 2023-06-27 22:55:09,724 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.61 vs. limit=10.0 2023-06-27 22:55:18,184 INFO [train.py:996] (2/4) Epoch 11, batch 14350, loss[loss=0.2024, simple_loss=0.2778, pruned_loss=0.06354, over 21756.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2912, pruned_loss=0.06478, over 4252181.41 frames. ], batch size: 112, lr: 2.66e-03, grad_scale: 8.0 2023-06-27 22:55:27,948 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.575e+02 7.291e+02 1.107e+03 2.245e+03 6.428e+03, threshold=2.214e+03, percent-clipped=28.0 2023-06-27 22:55:43,467 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1915836.0, ans=0.2 2023-06-27 22:55:58,552 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1915896.0, ans=0.2 2023-06-27 22:56:32,384 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 22:56:41,719 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1916016.0, ans=0.125 2023-06-27 22:56:46,575 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1916016.0, ans=0.0 2023-06-27 22:56:59,067 INFO [train.py:996] (2/4) Epoch 11, batch 14400, loss[loss=0.1878, simple_loss=0.256, pruned_loss=0.05979, over 21300.00 frames. ], tot_loss[loss=0.2108, simple_loss=0.2896, pruned_loss=0.066, over 4265304.83 frames. ], batch size: 160, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:57:01,625 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1916076.0, ans=0.2 2023-06-27 22:57:03,298 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1916076.0, ans=0.1 2023-06-27 22:57:29,334 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff3.min_abs, batch_count=1916136.0, ans=0.2 2023-06-27 22:57:40,703 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1916196.0, ans=0.2 2023-06-27 22:58:39,449 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1916376.0, ans=0.125 2023-06-27 22:58:40,446 INFO [train.py:996] (2/4) Epoch 11, batch 14450, loss[loss=0.2064, simple_loss=0.2745, pruned_loss=0.06914, over 21255.00 frames. ], tot_loss[loss=0.2081, simple_loss=0.2846, pruned_loss=0.06581, over 4266174.37 frames. ], batch size: 159, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:58:50,306 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.736e+02 6.824e+02 1.004e+03 1.771e+03 3.739e+03, threshold=2.008e+03, percent-clipped=15.0 2023-06-27 22:59:29,966 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1916496.0, ans=0.125 2023-06-27 23:00:21,626 INFO [train.py:996] (2/4) Epoch 11, batch 14500, loss[loss=0.1917, simple_loss=0.2684, pruned_loss=0.05754, over 21854.00 frames. ], tot_loss[loss=0.2062, simple_loss=0.2815, pruned_loss=0.06545, over 4275630.53 frames. ], batch size: 107, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 23:00:46,423 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1916736.0, ans=10.0 2023-06-27 23:01:15,337 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1916796.0, ans=0.125 2023-06-27 23:01:56,524 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1916916.0, ans=0.125 2023-06-27 23:02:04,604 INFO [train.py:996] (2/4) Epoch 11, batch 14550, loss[loss=0.259, simple_loss=0.341, pruned_loss=0.08851, over 21779.00 frames. ], tot_loss[loss=0.2093, simple_loss=0.2857, pruned_loss=0.06647, over 4275625.23 frames. ], batch size: 124, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 23:02:14,903 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.488e+02 6.849e+02 9.219e+02 1.443e+03 4.541e+03, threshold=1.844e+03, percent-clipped=15.0 2023-06-27 23:03:05,487 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1917096.0, ans=0.125 2023-06-27 23:03:25,873 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1917156.0, ans=0.125 2023-06-27 23:03:31,155 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1917216.0, ans=0.125 2023-06-27 23:03:42,251 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1917216.0, ans=0.125 2023-06-27 23:03:48,466 INFO [train.py:996] (2/4) Epoch 11, batch 14600, loss[loss=0.2289, simple_loss=0.3155, pruned_loss=0.07113, over 21813.00 frames. ], tot_loss[loss=0.2167, simple_loss=0.2934, pruned_loss=0.06998, over 4276272.33 frames. ], batch size: 124, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:04:28,772 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1917336.0, ans=0.0 2023-06-27 23:05:31,382 INFO [train.py:996] (2/4) Epoch 11, batch 14650, loss[loss=0.1815, simple_loss=0.2482, pruned_loss=0.0574, over 20758.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2948, pruned_loss=0.06877, over 4279679.28 frames. ], batch size: 608, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:05:36,934 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1917576.0, ans=0.125 2023-06-27 23:05:41,643 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1917576.0, ans=0.125 2023-06-27 23:05:45,912 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.373e+02 8.214e+02 1.374e+03 1.981e+03 3.761e+03, threshold=2.748e+03, percent-clipped=28.0 2023-06-27 23:06:28,429 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=1917696.0, ans=15.0 2023-06-27 23:06:39,458 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1917756.0, ans=0.125 2023-06-27 23:06:49,902 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1917756.0, ans=0.025 2023-06-27 23:06:55,414 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.25 vs. limit=12.0 2023-06-27 23:06:56,788 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1917816.0, ans=0.07 2023-06-27 23:07:19,687 INFO [train.py:996] (2/4) Epoch 11, batch 14700, loss[loss=0.2259, simple_loss=0.3192, pruned_loss=0.06633, over 21279.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2915, pruned_loss=0.06462, over 4276550.29 frames. ], batch size: 548, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:07:22,139 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1917876.0, ans=0.125 2023-06-27 23:07:32,185 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1917876.0, ans=0.125 2023-06-27 23:07:50,231 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1917936.0, ans=0.2 2023-06-27 23:09:01,602 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1918116.0, ans=0.0 2023-06-27 23:09:04,343 INFO [train.py:996] (2/4) Epoch 11, batch 14750, loss[loss=0.2502, simple_loss=0.3274, pruned_loss=0.08647, over 21589.00 frames. ], tot_loss[loss=0.2158, simple_loss=0.297, pruned_loss=0.06732, over 4278652.01 frames. ], batch size: 389, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:09:12,135 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1918176.0, ans=0.1 2023-06-27 23:09:14,865 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.274e+02 6.652e+02 9.504e+02 1.333e+03 3.432e+03, threshold=1.901e+03, percent-clipped=1.0 2023-06-27 23:09:54,902 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1918296.0, ans=0.2 2023-06-27 23:10:35,711 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1918416.0, ans=0.0 2023-06-27 23:10:48,844 INFO [train.py:996] (2/4) Epoch 11, batch 14800, loss[loss=0.2211, simple_loss=0.2834, pruned_loss=0.07944, over 21812.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.3097, pruned_loss=0.07328, over 4280547.98 frames. ], batch size: 107, lr: 2.65e-03, grad_scale: 32.0 2023-06-27 23:11:09,947 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1918476.0, ans=0.1 2023-06-27 23:11:41,864 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1918596.0, ans=0.125 2023-06-27 23:12:12,732 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1918656.0, ans=0.1 2023-06-27 23:12:43,482 INFO [train.py:996] (2/4) Epoch 11, batch 14850, loss[loss=0.1828, simple_loss=0.2539, pruned_loss=0.05582, over 21787.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.3032, pruned_loss=0.07283, over 4278609.33 frames. ], batch size: 124, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:13:00,843 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.760e+02 8.785e+02 1.252e+03 1.775e+03 4.444e+03, threshold=2.503e+03, percent-clipped=22.0 2023-06-27 23:13:01,474 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1918776.0, ans=0.1 2023-06-27 23:13:06,606 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1918836.0, ans=0.0 2023-06-27 23:13:33,181 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1918896.0, ans=0.125 2023-06-27 23:14:32,373 INFO [train.py:996] (2/4) Epoch 11, batch 14900, loss[loss=0.2401, simple_loss=0.3097, pruned_loss=0.08522, over 21363.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.3058, pruned_loss=0.07357, over 4278071.01 frames. ], batch size: 176, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:14:36,964 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.62 vs. limit=15.0 2023-06-27 23:15:07,516 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1919136.0, ans=0.07 2023-06-27 23:15:35,790 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1919256.0, ans=0.125 2023-06-27 23:15:58,220 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1919316.0, ans=0.0 2023-06-27 23:16:16,124 INFO [train.py:996] (2/4) Epoch 11, batch 14950, loss[loss=0.2161, simple_loss=0.2999, pruned_loss=0.06613, over 21801.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.3063, pruned_loss=0.07266, over 4279162.74 frames. ], batch size: 282, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:16:27,767 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.435e+02 7.906e+02 1.198e+03 1.645e+03 4.202e+03, threshold=2.397e+03, percent-clipped=8.0 2023-06-27 23:16:41,710 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1919436.0, ans=0.0 2023-06-27 23:17:37,383 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff2.min_abs, batch_count=1919556.0, ans=0.1 2023-06-27 23:17:37,451 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1919556.0, ans=0.0 2023-06-27 23:17:52,324 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1919616.0, ans=0.125 2023-06-27 23:17:58,256 INFO [train.py:996] (2/4) Epoch 11, batch 15000, loss[loss=0.2296, simple_loss=0.2993, pruned_loss=0.07994, over 21298.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.3085, pruned_loss=0.07383, over 4277662.95 frames. ], batch size: 159, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:17:58,256 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-27 23:18:18,456 INFO [train.py:1028] (2/4) Epoch 11, validation: loss=0.2534, simple_loss=0.3437, pruned_loss=0.08155, over 1796401.00 frames. 2023-06-27 23:18:18,457 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 23793MB 2023-06-27 23:19:38,843 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1919856.0, ans=0.04949747468305833 2023-06-27 23:19:50,822 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1919916.0, ans=0.125 2023-06-27 23:20:03,428 INFO [train.py:996] (2/4) Epoch 11, batch 15050, loss[loss=0.2722, simple_loss=0.3623, pruned_loss=0.09102, over 21537.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.3083, pruned_loss=0.0744, over 4273133.40 frames. ], batch size: 471, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:20:04,288 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1919976.0, ans=0.125 2023-06-27 23:20:17,264 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.821e+02 6.928e+02 9.435e+02 1.433e+03 3.639e+03, threshold=1.887e+03, percent-clipped=3.0 2023-06-27 23:20:39,054 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1920036.0, ans=0.125 2023-06-27 23:21:06,381 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1920096.0, ans=0.125 2023-06-27 23:21:49,512 INFO [train.py:996] (2/4) Epoch 11, batch 15100, loss[loss=0.2007, simple_loss=0.2753, pruned_loss=0.06309, over 20837.00 frames. ], tot_loss[loss=0.2271, simple_loss=0.3085, pruned_loss=0.07285, over 4269173.02 frames. ], batch size: 608, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:21:50,663 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.67 vs. limit=10.0 2023-06-27 23:21:55,210 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1920276.0, ans=0.2 2023-06-27 23:22:45,286 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1920396.0, ans=0.1 2023-06-27 23:23:30,442 INFO [train.py:996] (2/4) Epoch 11, batch 15150, loss[loss=0.1909, simple_loss=0.2554, pruned_loss=0.06316, over 21323.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.3046, pruned_loss=0.07304, over 4272696.12 frames. ], batch size: 211, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:23:31,473 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.31 vs. limit=15.0 2023-06-27 23:23:46,526 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.704e+02 7.406e+02 1.033e+03 1.604e+03 3.709e+03, threshold=2.066e+03, percent-clipped=14.0 2023-06-27 23:24:05,527 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1920636.0, ans=0.125 2023-06-27 23:24:49,030 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.74 vs. limit=15.0 2023-06-27 23:24:56,391 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1920816.0, ans=0.125 2023-06-27 23:25:10,416 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1920816.0, ans=0.125 2023-06-27 23:25:12,946 INFO [train.py:996] (2/4) Epoch 11, batch 15200, loss[loss=0.1806, simple_loss=0.2659, pruned_loss=0.04761, over 21727.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2956, pruned_loss=0.06923, over 4271127.21 frames. ], batch size: 282, lr: 2.65e-03, grad_scale: 32.0 2023-06-27 23:25:40,286 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1920936.0, ans=0.09899494936611666 2023-06-27 23:26:08,755 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1920996.0, ans=0.2 2023-06-27 23:26:17,115 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1920996.0, ans=0.125 2023-06-27 23:27:01,185 INFO [train.py:996] (2/4) Epoch 11, batch 15250, loss[loss=0.1842, simple_loss=0.2622, pruned_loss=0.05311, over 21521.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.2899, pruned_loss=0.06783, over 4257813.12 frames. ], batch size: 230, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:27:23,401 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.342e+02 7.886e+02 1.142e+03 1.659e+03 3.992e+03, threshold=2.285e+03, percent-clipped=18.0 2023-06-27 23:28:38,637 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 23:28:42,814 INFO [train.py:996] (2/4) Epoch 11, batch 15300, loss[loss=0.2406, simple_loss=0.3154, pruned_loss=0.0829, over 21675.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.2911, pruned_loss=0.06923, over 4266153.76 frames. ], batch size: 351, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:30:29,668 INFO [train.py:996] (2/4) Epoch 11, batch 15350, loss[loss=0.2076, simple_loss=0.3015, pruned_loss=0.05683, over 21634.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.295, pruned_loss=0.0714, over 4265433.94 frames. ], batch size: 230, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:30:32,240 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.74 vs. limit=10.0 2023-06-27 23:30:47,341 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.758e+02 7.708e+02 1.113e+03 1.589e+03 3.642e+03, threshold=2.225e+03, percent-clipped=6.0 2023-06-27 23:30:57,115 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.87 vs. limit=15.0 2023-06-27 23:30:59,821 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1921836.0, ans=0.2 2023-06-27 23:31:31,032 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.26 vs. limit=15.0 2023-06-27 23:31:55,124 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1922016.0, ans=0.125 2023-06-27 23:32:05,690 INFO [train.py:996] (2/4) Epoch 11, batch 15400, loss[loss=0.2149, simple_loss=0.2919, pruned_loss=0.06891, over 21476.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.2957, pruned_loss=0.06987, over 4261250.54 frames. ], batch size: 194, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:32:15,578 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1922076.0, ans=0.125 2023-06-27 23:32:32,510 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.57 vs. limit=6.0 2023-06-27 23:32:51,341 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=1922196.0, ans=10.0 2023-06-27 23:33:23,224 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1922256.0, ans=0.125 2023-06-27 23:33:47,728 INFO [train.py:996] (2/4) Epoch 11, batch 15450, loss[loss=0.2035, simple_loss=0.2833, pruned_loss=0.06186, over 21333.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.2945, pruned_loss=0.07003, over 4270476.07 frames. ], batch size: 159, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:33:57,778 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1922376.0, ans=0.04949747468305833 2023-06-27 23:34:10,725 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.367e+02 6.968e+02 9.606e+02 1.449e+03 2.613e+03, threshold=1.921e+03, percent-clipped=5.0 2023-06-27 23:34:46,587 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.41 vs. limit=15.0 2023-06-27 23:35:34,387 INFO [train.py:996] (2/4) Epoch 11, batch 15500, loss[loss=0.2343, simple_loss=0.3125, pruned_loss=0.07803, over 21403.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.2978, pruned_loss=0.06994, over 4244894.95 frames. ], batch size: 211, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:35:56,692 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.87 vs. limit=22.5 2023-06-27 23:36:12,796 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1922736.0, ans=0.1 2023-06-27 23:36:57,862 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1922916.0, ans=0.125 2023-06-27 23:37:11,771 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.40 vs. limit=22.5 2023-06-27 23:37:19,072 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1922916.0, ans=0.125 2023-06-27 23:37:21,929 INFO [train.py:996] (2/4) Epoch 11, batch 15550, loss[loss=0.1881, simple_loss=0.2621, pruned_loss=0.05705, over 21379.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2982, pruned_loss=0.068, over 4245914.94 frames. ], batch size: 211, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:37:27,640 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1922976.0, ans=0.0 2023-06-27 23:37:34,972 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.488e+02 6.660e+02 9.717e+02 1.306e+03 2.635e+03, threshold=1.943e+03, percent-clipped=6.0 2023-06-27 23:37:50,313 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1923036.0, ans=0.125 2023-06-27 23:38:33,651 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1923156.0, ans=0.125 2023-06-27 23:39:03,936 INFO [train.py:996] (2/4) Epoch 11, batch 15600, loss[loss=0.2056, simple_loss=0.2997, pruned_loss=0.05579, over 21255.00 frames. ], tot_loss[loss=0.2133, simple_loss=0.2933, pruned_loss=0.06669, over 4242667.36 frames. ], batch size: 548, lr: 2.65e-03, grad_scale: 32.0 2023-06-27 23:39:14,725 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.11 vs. limit=10.0 2023-06-27 23:39:16,106 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1923276.0, ans=0.1 2023-06-27 23:39:40,860 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.51 vs. limit=15.0 2023-06-27 23:40:18,552 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.43 vs. limit=15.0 2023-06-27 23:40:20,307 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.43 vs. limit=15.0 2023-06-27 23:40:42,475 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1923516.0, ans=0.125 2023-06-27 23:40:45,236 INFO [train.py:996] (2/4) Epoch 11, batch 15650, loss[loss=0.1757, simple_loss=0.2409, pruned_loss=0.05522, over 15438.00 frames. ], tot_loss[loss=0.2121, simple_loss=0.2913, pruned_loss=0.06652, over 4234771.59 frames. ], batch size: 61, lr: 2.65e-03, grad_scale: 32.0 2023-06-27 23:41:03,340 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.415e+02 8.795e+02 1.290e+03 1.896e+03 3.786e+03, threshold=2.580e+03, percent-clipped=24.0 2023-06-27 23:42:00,088 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1923816.0, ans=0.2 2023-06-27 23:42:27,307 INFO [train.py:996] (2/4) Epoch 11, batch 15700, loss[loss=0.2019, simple_loss=0.2675, pruned_loss=0.06816, over 21175.00 frames. ], tot_loss[loss=0.2094, simple_loss=0.287, pruned_loss=0.06585, over 4239881.30 frames. ], batch size: 159, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:42:41,043 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 23:42:54,879 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.73 vs. limit=15.0 2023-06-27 23:43:02,231 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1923936.0, ans=0.0 2023-06-27 23:43:13,874 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1923996.0, ans=0.2 2023-06-27 23:43:50,745 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1924116.0, ans=0.015 2023-06-27 23:44:08,201 INFO [train.py:996] (2/4) Epoch 11, batch 15750, loss[loss=0.1931, simple_loss=0.2563, pruned_loss=0.06497, over 21849.00 frames. ], tot_loss[loss=0.207, simple_loss=0.2826, pruned_loss=0.0657, over 4249491.84 frames. ], batch size: 98, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:44:27,434 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.257e+02 5.974e+02 8.242e+02 1.132e+03 2.648e+03, threshold=1.648e+03, percent-clipped=1.0 2023-06-27 23:44:46,875 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1924296.0, ans=0.125 2023-06-27 23:45:49,089 INFO [train.py:996] (2/4) Epoch 11, batch 15800, loss[loss=0.2045, simple_loss=0.2787, pruned_loss=0.0652, over 16524.00 frames. ], tot_loss[loss=0.2041, simple_loss=0.2779, pruned_loss=0.06516, over 4231085.87 frames. ], batch size: 60, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:46:06,152 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1924476.0, ans=0.05 2023-06-27 23:46:29,903 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1924596.0, ans=0.025 2023-06-27 23:47:32,292 INFO [train.py:996] (2/4) Epoch 11, batch 15850, loss[loss=0.1841, simple_loss=0.2455, pruned_loss=0.06132, over 21518.00 frames. ], tot_loss[loss=0.2072, simple_loss=0.2804, pruned_loss=0.06703, over 4246063.07 frames. ], batch size: 212, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:47:47,661 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1924776.0, ans=0.125 2023-06-27 23:47:52,240 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.544e+02 6.551e+02 9.403e+02 1.336e+03 2.589e+03, threshold=1.881e+03, percent-clipped=10.0 2023-06-27 23:48:45,241 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1924956.0, ans=0.125 2023-06-27 23:49:15,200 INFO [train.py:996] (2/4) Epoch 11, batch 15900, loss[loss=0.2353, simple_loss=0.3095, pruned_loss=0.08059, over 21399.00 frames. ], tot_loss[loss=0.2068, simple_loss=0.2788, pruned_loss=0.06741, over 4242972.77 frames. ], batch size: 211, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:49:33,454 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1925076.0, ans=0.0 2023-06-27 23:49:37,253 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.34 vs. limit=22.5 2023-06-27 23:50:46,569 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1925316.0, ans=0.125 2023-06-27 23:50:57,575 INFO [train.py:996] (2/4) Epoch 11, batch 15950, loss[loss=0.1923, simple_loss=0.3037, pruned_loss=0.04041, over 21247.00 frames. ], tot_loss[loss=0.2072, simple_loss=0.282, pruned_loss=0.06619, over 4245375.64 frames. ], batch size: 548, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:51:17,433 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.351e+02 7.372e+02 1.063e+03 1.688e+03 3.100e+03, threshold=2.125e+03, percent-clipped=16.0 2023-06-27 23:51:29,439 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1925436.0, ans=0.2 2023-06-27 23:51:47,127 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=1925496.0, ans=15.0 2023-06-27 23:52:21,659 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1925616.0, ans=0.04949747468305833 2023-06-27 23:52:40,050 INFO [train.py:996] (2/4) Epoch 11, batch 16000, loss[loss=0.2461, simple_loss=0.3364, pruned_loss=0.07793, over 21514.00 frames. ], tot_loss[loss=0.2062, simple_loss=0.2834, pruned_loss=0.06454, over 4248240.23 frames. ], batch size: 471, lr: 2.65e-03, grad_scale: 32.0 2023-06-27 23:52:50,808 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1925676.0, ans=0.125 2023-06-27 23:53:21,446 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1925796.0, ans=0.1 2023-06-27 23:53:41,974 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1925856.0, ans=0.125 2023-06-27 23:53:42,605 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.85 vs. limit=10.0 2023-06-27 23:53:47,878 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.85 vs. limit=12.0 2023-06-27 23:53:56,806 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1925916.0, ans=0.125 2023-06-27 23:54:01,016 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.51 vs. limit=22.5 2023-06-27 23:54:10,038 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1925916.0, ans=0.125 2023-06-27 23:54:17,617 INFO [train.py:996] (2/4) Epoch 11, batch 16050, loss[loss=0.2145, simple_loss=0.3097, pruned_loss=0.05966, over 21748.00 frames. ], tot_loss[loss=0.2061, simple_loss=0.2864, pruned_loss=0.0629, over 4247585.08 frames. ], batch size: 247, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:54:37,053 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1925976.0, ans=0.035 2023-06-27 23:54:37,079 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1925976.0, ans=0.0 2023-06-27 23:54:43,012 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.281e+02 6.057e+02 9.389e+02 1.429e+03 3.235e+03, threshold=1.878e+03, percent-clipped=6.0 2023-06-27 23:54:43,587 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1926036.0, ans=0.0 2023-06-27 23:54:46,777 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1926036.0, ans=0.0 2023-06-27 23:54:59,778 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1926096.0, ans=0.125 2023-06-27 23:55:56,794 INFO [train.py:996] (2/4) Epoch 11, batch 16100, loss[loss=0.2362, simple_loss=0.316, pruned_loss=0.07824, over 21803.00 frames. ], tot_loss[loss=0.2101, simple_loss=0.2912, pruned_loss=0.06447, over 4258807.89 frames. ], batch size: 112, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:55:57,283 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1926276.0, ans=0.0 2023-06-27 23:56:21,496 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1926336.0, ans=0.125 2023-06-27 23:56:44,109 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1926396.0, ans=0.125 2023-06-27 23:56:56,460 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.65 vs. limit=15.0 2023-06-27 23:57:37,663 INFO [train.py:996] (2/4) Epoch 11, batch 16150, loss[loss=0.2124, simple_loss=0.2738, pruned_loss=0.07551, over 21583.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.292, pruned_loss=0.06637, over 4267713.94 frames. ], batch size: 548, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:58:03,761 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.374e+02 7.210e+02 1.100e+03 1.545e+03 2.941e+03, threshold=2.200e+03, percent-clipped=14.0 2023-06-27 23:58:27,017 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.53 vs. limit=15.0 2023-06-27 23:58:37,505 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.66 vs. limit=22.5 2023-06-27 23:59:19,452 INFO [train.py:996] (2/4) Epoch 11, batch 16200, loss[loss=0.2477, simple_loss=0.3281, pruned_loss=0.08368, over 21267.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2947, pruned_loss=0.06694, over 4270065.34 frames. ], batch size: 143, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:59:36,152 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1926876.0, ans=0.0 2023-06-27 23:59:56,769 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.09 vs. limit=15.0 2023-06-27 23:59:59,910 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1926936.0, ans=0.125 2023-06-28 00:00:01,385 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1926996.0, ans=0.5 2023-06-28 00:00:34,066 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1927056.0, ans=0.125 2023-06-28 00:01:06,255 INFO [train.py:996] (2/4) Epoch 11, batch 16250, loss[loss=0.1955, simple_loss=0.2736, pruned_loss=0.05869, over 21454.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2945, pruned_loss=0.06705, over 4272489.67 frames. ], batch size: 194, lr: 2.65e-03, grad_scale: 16.0 2023-06-28 00:01:27,652 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.408e+02 8.189e+02 1.172e+03 1.830e+03 4.029e+03, threshold=2.343e+03, percent-clipped=14.0 2023-06-28 00:01:35,568 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.89 vs. limit=15.0 2023-06-28 00:01:54,901 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1927296.0, ans=0.0 2023-06-28 00:02:11,952 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.53 vs. limit=10.0 2023-06-28 00:02:15,381 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.32 vs. limit=6.0 2023-06-28 00:02:53,057 INFO [train.py:996] (2/4) Epoch 11, batch 16300, loss[loss=0.1695, simple_loss=0.256, pruned_loss=0.0415, over 21428.00 frames. ], tot_loss[loss=0.2078, simple_loss=0.2881, pruned_loss=0.06368, over 4268952.81 frames. ], batch size: 211, lr: 2.65e-03, grad_scale: 16.0 2023-06-28 00:03:05,875 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.65 vs. limit=6.0 2023-06-28 00:03:25,638 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1927596.0, ans=0.125 2023-06-28 00:04:28,222 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.20 vs. limit=22.5 2023-06-28 00:04:37,037 INFO [train.py:996] (2/4) Epoch 11, batch 16350, loss[loss=0.2786, simple_loss=0.3614, pruned_loss=0.09787, over 21814.00 frames. ], tot_loss[loss=0.2087, simple_loss=0.2883, pruned_loss=0.06454, over 4271897.58 frames. ], batch size: 118, lr: 2.65e-03, grad_scale: 16.0 2023-06-28 00:04:41,026 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1927776.0, ans=0.04949747468305833 2023-06-28 00:04:41,483 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.91 vs. limit=10.0 2023-06-28 00:04:47,531 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1927776.0, ans=0.1 2023-06-28 00:04:53,548 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.626e+02 5.951e+02 8.785e+02 1.347e+03 2.273e+03, threshold=1.757e+03, percent-clipped=0.0 2023-06-28 00:05:12,789 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1927836.0, ans=0.0 2023-06-28 00:05:45,625 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1927956.0, ans=0.1 2023-06-28 00:06:10,940 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1928016.0, ans=0.1 2023-06-28 00:06:15,120 INFO [train.py:996] (2/4) Epoch 11, batch 16400, loss[loss=0.2381, simple_loss=0.3086, pruned_loss=0.08381, over 21805.00 frames. ], tot_loss[loss=0.2109, simple_loss=0.291, pruned_loss=0.06545, over 4278720.01 frames. ], batch size: 414, lr: 2.65e-03, grad_scale: 32.0 2023-06-28 00:06:27,876 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.63 vs. limit=15.0 2023-06-28 00:07:08,539 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1928256.0, ans=0.125 2023-06-28 00:07:08,749 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1928256.0, ans=0.125 2023-06-28 00:07:27,006 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1928256.0, ans=0.0 2023-06-28 00:07:54,058 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1928316.0, ans=0.125 2023-06-28 00:07:56,945 INFO [train.py:996] (2/4) Epoch 11, batch 16450, loss[loss=0.2254, simple_loss=0.3125, pruned_loss=0.06921, over 21415.00 frames. ], tot_loss[loss=0.2106, simple_loss=0.2902, pruned_loss=0.06549, over 4284273.14 frames. ], batch size: 548, lr: 2.65e-03, grad_scale: 16.0 2023-06-28 00:08:16,240 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.401e+02 6.665e+02 9.796e+02 1.595e+03 2.942e+03, threshold=1.959e+03, percent-clipped=15.0 2023-06-28 00:08:33,963 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.19 vs. limit=22.5 2023-06-28 00:08:37,199 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1928496.0, ans=0.125 2023-06-28 00:09:41,724 INFO [train.py:996] (2/4) Epoch 11, batch 16500, loss[loss=0.2568, simple_loss=0.3376, pruned_loss=0.088, over 21506.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2886, pruned_loss=0.06611, over 4276552.10 frames. ], batch size: 471, lr: 2.65e-03, grad_scale: 16.0 2023-06-28 00:09:48,153 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.88 vs. limit=6.0 2023-06-28 00:10:13,312 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.37 vs. limit=15.0 2023-06-28 00:10:17,676 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1928736.0, ans=0.125 2023-06-28 00:10:32,810 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1928796.0, ans=0.125 2023-06-28 00:10:56,357 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1928856.0, ans=0.125 2023-06-28 00:11:18,306 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1928916.0, ans=0.0 2023-06-28 00:11:26,477 INFO [train.py:996] (2/4) Epoch 11, batch 16550, loss[loss=0.2089, simple_loss=0.2936, pruned_loss=0.0621, over 21752.00 frames. ], tot_loss[loss=0.21, simple_loss=0.2898, pruned_loss=0.06512, over 4275130.03 frames. ], batch size: 298, lr: 2.65e-03, grad_scale: 16.0 2023-06-28 00:11:36,579 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1928976.0, ans=0.07 2023-06-28 00:11:42,192 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1928976.0, ans=0.0 2023-06-28 00:11:44,095 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1928976.0, ans=0.125 2023-06-28 00:11:48,981 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1929036.0, ans=0.2 2023-06-28 00:11:50,017 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.361e+02 7.345e+02 1.277e+03 1.917e+03 4.181e+03, threshold=2.555e+03, percent-clipped=23.0 2023-06-28 00:11:52,454 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1929036.0, ans=0.0 2023-06-28 00:12:56,356 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1929216.0, ans=0.125 2023-06-28 00:13:04,215 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1929216.0, ans=0.125 2023-06-28 00:13:07,504 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1929216.0, ans=0.0 2023-06-28 00:13:15,360 INFO [train.py:996] (2/4) Epoch 11, batch 16600, loss[loss=0.2579, simple_loss=0.3562, pruned_loss=0.07973, over 21743.00 frames. ], tot_loss[loss=0.2156, simple_loss=0.2964, pruned_loss=0.0674, over 4272628.82 frames. ], batch size: 247, lr: 2.65e-03, grad_scale: 16.0 2023-06-28 00:13:35,950 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1929336.0, ans=0.0 2023-06-28 00:13:41,400 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.19 vs. limit=12.0 2023-06-28 00:13:47,751 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1929336.0, ans=0.1 2023-06-28 00:14:42,219 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.33 vs. limit=15.0 2023-06-28 00:15:00,043 INFO [train.py:996] (2/4) Epoch 11, batch 16650, loss[loss=0.1939, simple_loss=0.315, pruned_loss=0.03644, over 20858.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.3064, pruned_loss=0.06964, over 4268370.31 frames. ], batch size: 608, lr: 2.65e-03, grad_scale: 16.0 2023-06-28 00:15:00,818 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1929576.0, ans=0.05 2023-06-28 00:15:28,779 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.761e+02 8.064e+02 1.116e+03 1.585e+03 3.216e+03, threshold=2.231e+03, percent-clipped=5.0 2023-06-28 00:16:03,459 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1929696.0, ans=0.0 2023-06-28 00:16:18,694 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1929756.0, ans=0.125 2023-06-28 00:16:50,168 INFO [train.py:996] (2/4) Epoch 11, batch 16700, loss[loss=0.1875, simple_loss=0.2517, pruned_loss=0.06166, over 21378.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.3069, pruned_loss=0.07106, over 4271955.72 frames. ], batch size: 131, lr: 2.65e-03, grad_scale: 16.0 2023-06-28 00:17:22,962 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1929936.0, ans=0.0 2023-06-28 00:17:39,659 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1929996.0, ans=0.125 2023-06-28 00:18:14,536 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1930056.0, ans=0.0 2023-06-28 00:18:46,805 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.25 vs. limit=6.0 2023-06-28 00:18:47,088 INFO [train.py:996] (2/4) Epoch 11, batch 16750, loss[loss=0.2176, simple_loss=0.3104, pruned_loss=0.06244, over 21848.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.3091, pruned_loss=0.07278, over 4273895.27 frames. ], batch size: 282, lr: 2.65e-03, grad_scale: 16.0 2023-06-28 00:18:47,589 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1930176.0, ans=0.1 2023-06-28 00:19:12,037 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.860e+02 6.994e+02 8.979e+02 1.342e+03 3.526e+03, threshold=1.796e+03, percent-clipped=9.0 2023-06-28 00:20:13,483 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1930356.0, ans=0.1 2023-06-28 00:20:37,337 INFO [train.py:996] (2/4) Epoch 11, batch 16800, loss[loss=0.2312, simple_loss=0.3077, pruned_loss=0.07739, over 21876.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.3131, pruned_loss=0.07264, over 4273209.89 frames. ], batch size: 351, lr: 2.65e-03, grad_scale: 32.0 2023-06-28 00:20:38,607 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.56 vs. limit=15.0 2023-06-28 00:21:01,369 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.93 vs. limit=10.0 2023-06-28 00:21:17,268 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1930596.0, ans=0.1 2023-06-28 00:21:54,952 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1930656.0, ans=0.125 2023-06-28 00:22:14,177 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1930716.0, ans=0.0 2023-06-28 00:22:18,738 INFO [train.py:996] (2/4) Epoch 11, batch 16850, loss[loss=0.2118, simple_loss=0.2873, pruned_loss=0.06813, over 21932.00 frames. ], tot_loss[loss=0.2276, simple_loss=0.31, pruned_loss=0.0726, over 4277236.95 frames. ], batch size: 333, lr: 2.65e-03, grad_scale: 16.0 2023-06-28 00:22:24,655 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.41 vs. limit=15.0 2023-06-28 00:22:25,805 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1930776.0, ans=0.125 2023-06-28 00:22:31,607 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.45 vs. limit=15.0 2023-06-28 00:22:38,614 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.042e+02 8.354e+02 1.397e+03 2.191e+03 5.653e+03, threshold=2.793e+03, percent-clipped=35.0 2023-06-28 00:23:06,616 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1930896.0, ans=0.125 2023-06-28 00:23:33,693 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1930956.0, ans=0.1 2023-06-28 00:24:00,838 INFO [train.py:996] (2/4) Epoch 11, batch 16900, loss[loss=0.1826, simple_loss=0.2561, pruned_loss=0.05454, over 21584.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.3027, pruned_loss=0.07114, over 4279091.68 frames. ], batch size: 263, lr: 2.65e-03, grad_scale: 16.0 2023-06-28 00:24:01,390 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1931076.0, ans=0.125 2023-06-28 00:24:11,313 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1931076.0, ans=0.0 2023-06-28 00:24:19,206 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1931136.0, ans=0.125 2023-06-28 00:24:56,221 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1931196.0, ans=0.05 2023-06-28 00:25:20,468 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1931316.0, ans=0.0 2023-06-28 00:25:28,776 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 00:25:41,094 INFO [train.py:996] (2/4) Epoch 11, batch 16950, loss[loss=0.2017, simple_loss=0.2718, pruned_loss=0.06584, over 21619.00 frames. ], tot_loss[loss=0.216, simple_loss=0.2947, pruned_loss=0.06868, over 4281224.88 frames. ], batch size: 195, lr: 2.65e-03, grad_scale: 16.0 2023-06-28 00:26:00,773 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.390e+02 6.361e+02 9.262e+02 1.143e+03 1.974e+03, threshold=1.852e+03, percent-clipped=0.0 2023-06-28 00:26:10,693 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.22 vs. limit=6.0 2023-06-28 00:26:24,750 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1931496.0, ans=0.125 2023-06-28 00:26:45,657 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1931556.0, ans=0.0 2023-06-28 00:27:17,063 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.93 vs. limit=15.0 2023-06-28 00:27:22,676 INFO [train.py:996] (2/4) Epoch 11, batch 17000, loss[loss=0.2143, simple_loss=0.2906, pruned_loss=0.06905, over 21841.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2909, pruned_loss=0.06858, over 4289396.74 frames. ], batch size: 124, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 00:27:28,675 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1931676.0, ans=0.5 2023-06-28 00:27:36,168 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.76 vs. limit=15.0 2023-06-28 00:28:02,331 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1931796.0, ans=0.1 2023-06-28 00:28:33,021 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.19 vs. limit=15.0 2023-06-28 00:28:33,091 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.59 vs. limit=22.5 2023-06-28 00:29:06,155 INFO [train.py:996] (2/4) Epoch 11, batch 17050, loss[loss=0.2442, simple_loss=0.328, pruned_loss=0.08015, over 21837.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.2979, pruned_loss=0.07068, over 4288470.65 frames. ], batch size: 332, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 00:29:26,237 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.012e+02 8.433e+02 1.501e+03 2.176e+03 5.028e+03, threshold=3.003e+03, percent-clipped=35.0 2023-06-28 00:29:41,258 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1932036.0, ans=0.125 2023-06-28 00:29:49,159 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1932096.0, ans=0.0 2023-06-28 00:30:22,025 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.43 vs. limit=15.0 2023-06-28 00:30:32,719 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1932216.0, ans=0.1 2023-06-28 00:30:46,891 INFO [train.py:996] (2/4) Epoch 11, batch 17100, loss[loss=0.2035, simple_loss=0.2772, pruned_loss=0.06485, over 21878.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.2974, pruned_loss=0.07147, over 4289085.12 frames. ], batch size: 332, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 00:31:02,963 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1932336.0, ans=0.1 2023-06-28 00:31:04,395 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1932336.0, ans=0.0 2023-06-28 00:31:07,816 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1932336.0, ans=0.0 2023-06-28 00:31:19,306 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 00:32:01,865 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1932456.0, ans=0.1 2023-06-28 00:32:27,117 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1932516.0, ans=0.125 2023-06-28 00:32:29,947 INFO [train.py:996] (2/4) Epoch 11, batch 17150, loss[loss=0.1957, simple_loss=0.2859, pruned_loss=0.05281, over 21709.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.2939, pruned_loss=0.07083, over 4297195.04 frames. ], batch size: 414, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 00:32:30,513 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1932576.0, ans=0.0 2023-06-28 00:32:35,694 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1932576.0, ans=0.1 2023-06-28 00:32:54,502 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.748e+02 5.716e+02 7.652e+02 9.791e+02 2.028e+03, threshold=1.530e+03, percent-clipped=0.0 2023-06-28 00:33:21,725 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.38 vs. limit=15.0 2023-06-28 00:33:51,268 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1932756.0, ans=0.0 2023-06-28 00:33:58,029 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1932816.0, ans=0.0 2023-06-28 00:33:59,674 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1932816.0, ans=0.125 2023-06-28 00:34:16,988 INFO [train.py:996] (2/4) Epoch 11, batch 17200, loss[loss=0.3059, simple_loss=0.3556, pruned_loss=0.1281, over 21345.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2934, pruned_loss=0.0704, over 4292724.07 frames. ], batch size: 507, lr: 2.64e-03, grad_scale: 32.0 2023-06-28 00:34:17,745 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1932876.0, ans=0.125 2023-06-28 00:34:22,600 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1932876.0, ans=0.0 2023-06-28 00:35:09,609 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.38 vs. limit=12.0 2023-06-28 00:35:35,948 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1933056.0, ans=0.0 2023-06-28 00:35:43,016 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1933116.0, ans=0.0 2023-06-28 00:36:00,770 INFO [train.py:996] (2/4) Epoch 11, batch 17250, loss[loss=0.2407, simple_loss=0.317, pruned_loss=0.08217, over 21583.00 frames. ], tot_loss[loss=0.221, simple_loss=0.2978, pruned_loss=0.07208, over 4284615.50 frames. ], batch size: 389, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 00:36:32,859 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.884e+02 8.366e+02 1.182e+03 1.787e+03 4.360e+03, threshold=2.365e+03, percent-clipped=31.0 2023-06-28 00:36:39,013 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1933236.0, ans=0.125 2023-06-28 00:37:26,813 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1933416.0, ans=0.125 2023-06-28 00:37:28,723 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1933416.0, ans=0.0 2023-06-28 00:37:30,201 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1933416.0, ans=0.125 2023-06-28 00:37:30,703 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.48 vs. limit=15.0 2023-06-28 00:37:31,995 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1933416.0, ans=0.125 2023-06-28 00:37:49,426 INFO [train.py:996] (2/4) Epoch 11, batch 17300, loss[loss=0.2678, simple_loss=0.3351, pruned_loss=0.1002, over 21294.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.306, pruned_loss=0.07574, over 4281083.83 frames. ], batch size: 176, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 00:37:51,846 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1933476.0, ans=0.125 2023-06-28 00:38:30,940 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1933536.0, ans=0.125 2023-06-28 00:39:40,382 INFO [train.py:996] (2/4) Epoch 11, batch 17350, loss[loss=0.1977, simple_loss=0.2875, pruned_loss=0.05396, over 19885.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3065, pruned_loss=0.07525, over 4279338.80 frames. ], batch size: 702, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 00:39:59,518 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1933776.0, ans=0.125 2023-06-28 00:40:07,464 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.114e+02 8.299e+02 1.147e+03 1.835e+03 3.555e+03, threshold=2.294e+03, percent-clipped=8.0 2023-06-28 00:40:15,332 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.70 vs. limit=15.0 2023-06-28 00:40:18,400 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1933896.0, ans=0.1 2023-06-28 00:40:20,453 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1933896.0, ans=0.125 2023-06-28 00:40:21,958 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1933896.0, ans=0.125 2023-06-28 00:41:11,075 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1934016.0, ans=0.0 2023-06-28 00:41:25,744 INFO [train.py:996] (2/4) Epoch 11, batch 17400, loss[loss=0.2042, simple_loss=0.2919, pruned_loss=0.05823, over 21767.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.3026, pruned_loss=0.07161, over 4277937.66 frames. ], batch size: 332, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 00:41:55,369 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.22 vs. limit=15.0 2023-06-28 00:41:58,098 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1934136.0, ans=0.125 2023-06-28 00:42:25,475 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.42 vs. limit=10.0 2023-06-28 00:42:52,225 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.18 vs. limit=15.0 2023-06-28 00:43:13,927 INFO [train.py:996] (2/4) Epoch 11, batch 17450, loss[loss=0.1698, simple_loss=0.2477, pruned_loss=0.04591, over 21433.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.298, pruned_loss=0.06937, over 4278054.35 frames. ], batch size: 194, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 00:43:22,902 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1934376.0, ans=0.125 2023-06-28 00:43:26,363 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1934376.0, ans=0.0 2023-06-28 00:43:41,605 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.482e+02 8.576e+02 1.354e+03 2.024e+03 4.305e+03, threshold=2.708e+03, percent-clipped=16.0 2023-06-28 00:43:42,263 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1934436.0, ans=0.1 2023-06-28 00:44:08,500 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=1934496.0, ans=10.0 2023-06-28 00:44:55,336 INFO [train.py:996] (2/4) Epoch 11, batch 17500, loss[loss=0.1985, simple_loss=0.2712, pruned_loss=0.0629, over 21633.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2939, pruned_loss=0.06666, over 4275223.08 frames. ], batch size: 263, lr: 2.64e-03, grad_scale: 8.0 2023-06-28 00:45:32,320 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.44 vs. limit=15.0 2023-06-28 00:46:20,160 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1934916.0, ans=0.125 2023-06-28 00:46:35,451 INFO [train.py:996] (2/4) Epoch 11, batch 17550, loss[loss=0.199, simple_loss=0.2938, pruned_loss=0.05213, over 21626.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.2939, pruned_loss=0.06564, over 4273771.69 frames. ], batch size: 263, lr: 2.64e-03, grad_scale: 8.0 2023-06-28 00:47:02,918 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.317e+02 6.340e+02 7.775e+02 1.102e+03 1.869e+03, threshold=1.555e+03, percent-clipped=0.0 2023-06-28 00:47:13,505 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1935096.0, ans=0.0 2023-06-28 00:47:14,966 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1935096.0, ans=0.125 2023-06-28 00:47:36,399 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1935156.0, ans=0.0 2023-06-28 00:47:59,264 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=1935216.0, ans=10.0 2023-06-28 00:48:14,270 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1935216.0, ans=0.125 2023-06-28 00:48:16,979 INFO [train.py:996] (2/4) Epoch 11, batch 17600, loss[loss=0.2234, simple_loss=0.3054, pruned_loss=0.07074, over 21342.00 frames. ], tot_loss[loss=0.2147, simple_loss=0.2971, pruned_loss=0.0662, over 4274801.72 frames. ], batch size: 548, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 00:49:16,347 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1935456.0, ans=0.125 2023-06-28 00:49:36,881 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.07 vs. limit=10.0 2023-06-28 00:50:01,080 INFO [train.py:996] (2/4) Epoch 11, batch 17650, loss[loss=0.1896, simple_loss=0.2596, pruned_loss=0.0598, over 21429.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2969, pruned_loss=0.06665, over 4265155.28 frames. ], batch size: 131, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 00:50:03,431 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1935576.0, ans=0.125 2023-06-28 00:50:08,774 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1935576.0, ans=0.0 2023-06-28 00:50:29,624 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.907e+02 7.332e+02 1.084e+03 1.896e+03 3.594e+03, threshold=2.168e+03, percent-clipped=34.0 2023-06-28 00:50:50,052 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.87 vs. limit=15.0 2023-06-28 00:51:37,440 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1935816.0, ans=0.1 2023-06-28 00:51:49,575 INFO [train.py:996] (2/4) Epoch 11, batch 17700, loss[loss=0.2106, simple_loss=0.2988, pruned_loss=0.06123, over 21601.00 frames. ], tot_loss[loss=0.2099, simple_loss=0.2903, pruned_loss=0.06476, over 4257124.69 frames. ], batch size: 263, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 00:51:53,527 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1935876.0, ans=0.2 2023-06-28 00:52:19,270 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1935936.0, ans=0.125 2023-06-28 00:52:27,449 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1935996.0, ans=0.125 2023-06-28 00:52:33,803 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn1.whiten.whitening_limit, batch_count=1935996.0, ans=22.5 2023-06-28 00:53:07,515 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1936056.0, ans=0.0 2023-06-28 00:53:09,280 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1936056.0, ans=0.125 2023-06-28 00:53:30,484 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1936116.0, ans=0.125 2023-06-28 00:53:33,360 INFO [train.py:996] (2/4) Epoch 11, batch 17750, loss[loss=0.2377, simple_loss=0.3166, pruned_loss=0.07942, over 21412.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.2983, pruned_loss=0.06815, over 4265439.85 frames. ], batch size: 549, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 00:53:50,246 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1936176.0, ans=0.015 2023-06-28 00:53:56,920 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1936236.0, ans=0.1 2023-06-28 00:54:01,415 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.937e+02 7.178e+02 1.077e+03 1.520e+03 3.336e+03, threshold=2.154e+03, percent-clipped=9.0 2023-06-28 00:54:57,682 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.98 vs. limit=15.0 2023-06-28 00:55:09,379 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1936416.0, ans=0.1 2023-06-28 00:55:22,110 INFO [train.py:996] (2/4) Epoch 11, batch 17800, loss[loss=0.2188, simple_loss=0.3057, pruned_loss=0.06597, over 21707.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2981, pruned_loss=0.06793, over 4264752.72 frames. ], batch size: 351, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 00:55:22,959 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1936476.0, ans=0.0 2023-06-28 00:55:42,864 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.21 vs. limit=6.0 2023-06-28 00:56:26,467 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1936656.0, ans=0.0 2023-06-28 00:56:31,318 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1936656.0, ans=0.125 2023-06-28 00:57:02,927 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1936716.0, ans=0.125 2023-06-28 00:57:03,584 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.70 vs. limit=15.0 2023-06-28 00:57:05,743 INFO [train.py:996] (2/4) Epoch 11, batch 17850, loss[loss=0.2255, simple_loss=0.2972, pruned_loss=0.07696, over 21615.00 frames. ], tot_loss[loss=0.219, simple_loss=0.3001, pruned_loss=0.06894, over 4261435.62 frames. ], batch size: 263, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 00:57:34,256 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.504e+02 7.242e+02 1.057e+03 1.582e+03 3.438e+03, threshold=2.115e+03, percent-clipped=9.0 2023-06-28 00:57:51,534 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1936896.0, ans=0.0 2023-06-28 00:58:16,114 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1936956.0, ans=0.125 2023-06-28 00:58:48,609 INFO [train.py:996] (2/4) Epoch 11, batch 17900, loss[loss=0.2161, simple_loss=0.3169, pruned_loss=0.05764, over 21745.00 frames. ], tot_loss[loss=0.2215, simple_loss=0.3037, pruned_loss=0.06969, over 4266281.99 frames. ], batch size: 298, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 00:59:02,468 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1937076.0, ans=0.125 2023-06-28 00:59:17,431 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1937136.0, ans=0.125 2023-06-28 00:59:22,764 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1937136.0, ans=0.125 2023-06-28 00:59:52,932 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1937196.0, ans=0.1 2023-06-28 01:00:10,204 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1937256.0, ans=0.1 2023-06-28 01:00:37,334 INFO [train.py:996] (2/4) Epoch 11, batch 17950, loss[loss=0.2086, simple_loss=0.3136, pruned_loss=0.05179, over 21209.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.303, pruned_loss=0.06617, over 4266079.69 frames. ], batch size: 548, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:00:37,965 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1937376.0, ans=0.125 2023-06-28 01:01:09,591 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.263e+02 6.938e+02 9.459e+02 1.364e+03 3.127e+03, threshold=1.892e+03, percent-clipped=7.0 2023-06-28 01:01:10,192 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 01:01:13,344 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1937436.0, ans=0.0 2023-06-28 01:01:43,254 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.11 vs. limit=15.0 2023-06-28 01:01:45,819 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1937556.0, ans=0.0 2023-06-28 01:01:49,357 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1937556.0, ans=0.0 2023-06-28 01:02:20,246 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1937616.0, ans=0.1 2023-06-28 01:02:22,724 INFO [train.py:996] (2/4) Epoch 11, batch 18000, loss[loss=0.2173, simple_loss=0.2842, pruned_loss=0.07525, over 21991.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.2956, pruned_loss=0.06492, over 4262922.40 frames. ], batch size: 103, lr: 2.64e-03, grad_scale: 32.0 2023-06-28 01:02:22,724 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-28 01:02:39,152 INFO [train.py:1028] (2/4) Epoch 11, validation: loss=0.2572, simple_loss=0.3509, pruned_loss=0.08176, over 1796401.00 frames. 2023-06-28 01:02:39,152 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 23793MB 2023-06-28 01:02:46,860 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1937676.0, ans=0.05 2023-06-28 01:03:01,056 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.58 vs. limit=10.0 2023-06-28 01:03:23,210 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1937796.0, ans=0.0 2023-06-28 01:03:28,065 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1937796.0, ans=0.125 2023-06-28 01:04:22,696 INFO [train.py:996] (2/4) Epoch 11, batch 18050, loss[loss=0.1927, simple_loss=0.2623, pruned_loss=0.06153, over 21610.00 frames. ], tot_loss[loss=0.2091, simple_loss=0.2899, pruned_loss=0.06418, over 4262910.29 frames. ], batch size: 415, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:04:25,011 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1937976.0, ans=0.0 2023-06-28 01:04:58,010 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.600e+02 6.639e+02 9.648e+02 1.453e+03 3.276e+03, threshold=1.930e+03, percent-clipped=8.0 2023-06-28 01:05:08,474 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1938096.0, ans=0.1 2023-06-28 01:06:10,713 INFO [train.py:996] (2/4) Epoch 11, batch 18100, loss[loss=0.2284, simple_loss=0.3006, pruned_loss=0.07814, over 21818.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2946, pruned_loss=0.06663, over 4264424.66 frames. ], batch size: 102, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:06:14,684 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1938276.0, ans=0.125 2023-06-28 01:06:27,960 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1938276.0, ans=0.0 2023-06-28 01:06:59,964 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1938396.0, ans=0.125 2023-06-28 01:07:13,394 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1938456.0, ans=0.125 2023-06-28 01:07:48,927 INFO [train.py:996] (2/4) Epoch 11, batch 18150, loss[loss=0.1774, simple_loss=0.2565, pruned_loss=0.04915, over 21529.00 frames. ], tot_loss[loss=0.2142, simple_loss=0.2963, pruned_loss=0.06609, over 4270391.22 frames. ], batch size: 230, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:08:18,398 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.434e+02 6.385e+02 9.174e+02 1.252e+03 3.670e+03, threshold=1.835e+03, percent-clipped=3.0 2023-06-28 01:08:49,735 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1938756.0, ans=0.0 2023-06-28 01:09:10,162 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1938816.0, ans=0.0 2023-06-28 01:09:24,174 INFO [train.py:996] (2/4) Epoch 11, batch 18200, loss[loss=0.2208, simple_loss=0.285, pruned_loss=0.07829, over 21589.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.2902, pruned_loss=0.06607, over 4262989.15 frames. ], batch size: 415, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:09:59,053 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.66 vs. limit=10.0 2023-06-28 01:10:54,091 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=5.71 vs. limit=22.5 2023-06-28 01:11:04,700 INFO [train.py:996] (2/4) Epoch 11, batch 18250, loss[loss=0.1566, simple_loss=0.2339, pruned_loss=0.03963, over 21546.00 frames. ], tot_loss[loss=0.2055, simple_loss=0.2829, pruned_loss=0.06405, over 4255934.50 frames. ], batch size: 230, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:11:37,984 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.242e+02 6.955e+02 1.102e+03 1.552e+03 2.927e+03, threshold=2.205e+03, percent-clipped=10.0 2023-06-28 01:11:38,501 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1939236.0, ans=0.0 2023-06-28 01:11:41,998 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1939236.0, ans=0.125 2023-06-28 01:11:56,650 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1939296.0, ans=0.125 2023-06-28 01:12:17,212 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 01:12:46,289 INFO [train.py:996] (2/4) Epoch 11, batch 18300, loss[loss=0.216, simple_loss=0.3237, pruned_loss=0.05419, over 21647.00 frames. ], tot_loss[loss=0.2073, simple_loss=0.2848, pruned_loss=0.06488, over 4266414.60 frames. ], batch size: 263, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:13:33,562 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1939596.0, ans=0.125 2023-06-28 01:14:22,408 INFO [train.py:996] (2/4) Epoch 11, batch 18350, loss[loss=0.1927, simple_loss=0.2706, pruned_loss=0.05743, over 21544.00 frames. ], tot_loss[loss=0.2094, simple_loss=0.2898, pruned_loss=0.06451, over 4266887.59 frames. ], batch size: 263, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:14:36,151 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1939776.0, ans=0.1 2023-06-28 01:14:36,171 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1939776.0, ans=0.125 2023-06-28 01:14:56,376 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.730e+02 6.827e+02 1.100e+03 1.659e+03 4.791e+03, threshold=2.200e+03, percent-clipped=14.0 2023-06-28 01:15:00,510 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1939836.0, ans=0.1 2023-06-28 01:15:30,930 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1939956.0, ans=0.0 2023-06-28 01:15:54,040 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1940016.0, ans=0.0 2023-06-28 01:16:05,025 INFO [train.py:996] (2/4) Epoch 11, batch 18400, loss[loss=0.214, simple_loss=0.3013, pruned_loss=0.06336, over 21743.00 frames. ], tot_loss[loss=0.2064, simple_loss=0.2864, pruned_loss=0.06322, over 4271851.33 frames. ], batch size: 371, lr: 2.64e-03, grad_scale: 32.0 2023-06-28 01:16:23,616 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 01:16:48,726 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1940196.0, ans=0.0 2023-06-28 01:17:25,712 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=13.53 vs. limit=15.0 2023-06-28 01:17:37,795 INFO [train.py:996] (2/4) Epoch 11, batch 18450, loss[loss=0.1652, simple_loss=0.233, pruned_loss=0.04871, over 21748.00 frames. ], tot_loss[loss=0.2017, simple_loss=0.2834, pruned_loss=0.06001, over 4271715.63 frames. ], batch size: 112, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:18:14,204 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.159e+02 6.017e+02 7.931e+02 1.267e+03 3.301e+03, threshold=1.586e+03, percent-clipped=3.0 2023-06-28 01:18:41,633 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1940556.0, ans=0.0 2023-06-28 01:19:15,638 INFO [train.py:996] (2/4) Epoch 11, batch 18500, loss[loss=0.2455, simple_loss=0.3211, pruned_loss=0.08496, over 21503.00 frames. ], tot_loss[loss=0.1993, simple_loss=0.2792, pruned_loss=0.05963, over 4278437.01 frames. ], batch size: 508, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:19:22,976 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1940676.0, ans=0.015 2023-06-28 01:19:55,356 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1940736.0, ans=0.0 2023-06-28 01:19:58,652 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1940796.0, ans=0.125 2023-06-28 01:20:08,595 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1940796.0, ans=0.2 2023-06-28 01:20:15,703 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1940856.0, ans=0.125 2023-06-28 01:20:54,901 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1940916.0, ans=0.125 2023-06-28 01:20:57,750 INFO [train.py:996] (2/4) Epoch 11, batch 18550, loss[loss=0.1982, simple_loss=0.2681, pruned_loss=0.06419, over 21490.00 frames. ], tot_loss[loss=0.1968, simple_loss=0.2765, pruned_loss=0.05861, over 4282917.04 frames. ], batch size: 132, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:21:24,633 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1941036.0, ans=0.0 2023-06-28 01:21:32,022 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.25 vs. limit=10.0 2023-06-28 01:21:34,190 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.234e+02 6.100e+02 9.556e+02 1.452e+03 3.261e+03, threshold=1.911e+03, percent-clipped=19.0 2023-06-28 01:21:36,393 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_ff2.min_abs, batch_count=1941036.0, ans=0.1 2023-06-28 01:21:43,219 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1941096.0, ans=0.0 2023-06-28 01:22:45,325 INFO [train.py:996] (2/4) Epoch 11, batch 18600, loss[loss=0.3038, simple_loss=0.37, pruned_loss=0.1187, over 21531.00 frames. ], tot_loss[loss=0.1975, simple_loss=0.2756, pruned_loss=0.05968, over 4280209.44 frames. ], batch size: 509, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:23:21,296 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.76 vs. limit=22.5 2023-06-28 01:24:26,405 INFO [train.py:996] (2/4) Epoch 11, batch 18650, loss[loss=0.1879, simple_loss=0.2562, pruned_loss=0.05976, over 20024.00 frames. ], tot_loss[loss=0.1972, simple_loss=0.2741, pruned_loss=0.06012, over 4265070.61 frames. ], batch size: 703, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:24:32,257 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1941576.0, ans=0.125 2023-06-28 01:24:38,693 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1941576.0, ans=0.0 2023-06-28 01:24:52,414 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.322e+02 7.479e+02 1.141e+03 1.737e+03 3.586e+03, threshold=2.283e+03, percent-clipped=19.0 2023-06-28 01:25:57,779 INFO [train.py:996] (2/4) Epoch 11, batch 18700, loss[loss=0.1942, simple_loss=0.2447, pruned_loss=0.07186, over 20412.00 frames. ], tot_loss[loss=0.1965, simple_loss=0.2716, pruned_loss=0.0607, over 4267557.20 frames. ], batch size: 703, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:26:36,860 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1941936.0, ans=0.2 2023-06-28 01:27:40,762 INFO [train.py:996] (2/4) Epoch 11, batch 18750, loss[loss=0.1872, simple_loss=0.2587, pruned_loss=0.0578, over 21496.00 frames. ], tot_loss[loss=0.1983, simple_loss=0.2721, pruned_loss=0.06222, over 4269857.41 frames. ], batch size: 212, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:27:49,629 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1942176.0, ans=0.0 2023-06-28 01:27:54,445 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1942176.0, ans=0.125 2023-06-28 01:28:12,797 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1942236.0, ans=0.125 2023-06-28 01:28:17,086 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.499e+02 6.198e+02 1.010e+03 1.418e+03 2.835e+03, threshold=2.020e+03, percent-clipped=5.0 2023-06-28 01:28:19,242 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1942236.0, ans=0.125 2023-06-28 01:28:20,910 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1942236.0, ans=0.125 2023-06-28 01:28:50,769 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1942356.0, ans=0.125 2023-06-28 01:29:23,220 INFO [train.py:996] (2/4) Epoch 11, batch 18800, loss[loss=0.2127, simple_loss=0.3092, pruned_loss=0.05809, over 21661.00 frames. ], tot_loss[loss=0.2018, simple_loss=0.2773, pruned_loss=0.0632, over 4260019.92 frames. ], batch size: 441, lr: 2.64e-03, grad_scale: 32.0 2023-06-28 01:30:04,590 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1942596.0, ans=0.125 2023-06-28 01:30:30,329 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1942656.0, ans=0.125 2023-06-28 01:30:33,863 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1942656.0, ans=0.125 2023-06-28 01:30:55,019 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1942716.0, ans=0.0 2023-06-28 01:30:56,852 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1942716.0, ans=0.0 2023-06-28 01:31:04,466 INFO [train.py:996] (2/4) Epoch 11, batch 18850, loss[loss=0.1927, simple_loss=0.2704, pruned_loss=0.05746, over 21609.00 frames. ], tot_loss[loss=0.1968, simple_loss=0.2743, pruned_loss=0.0597, over 4252436.41 frames. ], batch size: 391, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:31:23,098 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.14 vs. limit=15.0 2023-06-28 01:31:35,671 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1942836.0, ans=0.1 2023-06-28 01:31:41,972 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.145e+02 6.934e+02 1.004e+03 1.636e+03 4.618e+03, threshold=2.007e+03, percent-clipped=13.0 2023-06-28 01:32:46,431 INFO [train.py:996] (2/4) Epoch 11, batch 18900, loss[loss=0.1915, simple_loss=0.2416, pruned_loss=0.07073, over 20220.00 frames. ], tot_loss[loss=0.1944, simple_loss=0.2704, pruned_loss=0.05921, over 4256108.33 frames. ], batch size: 702, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:32:47,097 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1943076.0, ans=0.125 2023-06-28 01:32:48,875 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1943076.0, ans=0.125 2023-06-28 01:32:49,472 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.83 vs. limit=12.0 2023-06-28 01:33:40,477 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.50 vs. limit=6.0 2023-06-28 01:34:17,362 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1943316.0, ans=0.125 2023-06-28 01:34:28,589 INFO [train.py:996] (2/4) Epoch 11, batch 18950, loss[loss=0.1793, simple_loss=0.2463, pruned_loss=0.0562, over 21454.00 frames. ], tot_loss[loss=0.1972, simple_loss=0.2723, pruned_loss=0.06106, over 4266781.16 frames. ], batch size: 212, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:35:07,395 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.266e+02 7.363e+02 1.116e+03 1.715e+03 3.795e+03, threshold=2.232e+03, percent-clipped=17.0 2023-06-28 01:35:14,843 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1943496.0, ans=0.125 2023-06-28 01:35:31,792 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1943496.0, ans=0.09899494936611666 2023-06-28 01:35:52,775 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1943556.0, ans=0.1 2023-06-28 01:36:16,480 INFO [train.py:996] (2/4) Epoch 11, batch 19000, loss[loss=0.2493, simple_loss=0.3255, pruned_loss=0.08656, over 21685.00 frames. ], tot_loss[loss=0.2034, simple_loss=0.2819, pruned_loss=0.06244, over 4273724.14 frames. ], batch size: 231, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:36:31,423 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1943676.0, ans=0.1 2023-06-28 01:36:39,923 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1943736.0, ans=10.0 2023-06-28 01:36:55,220 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.24 vs. limit=15.0 2023-06-28 01:37:28,800 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.29 vs. limit=15.0 2023-06-28 01:37:31,635 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1943856.0, ans=0.09899494936611666 2023-06-28 01:37:35,499 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.26 vs. limit=22.5 2023-06-28 01:37:59,364 INFO [train.py:996] (2/4) Epoch 11, batch 19050, loss[loss=0.218, simple_loss=0.2835, pruned_loss=0.07631, over 21478.00 frames. ], tot_loss[loss=0.2096, simple_loss=0.2868, pruned_loss=0.06618, over 4277485.32 frames. ], batch size: 194, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:38:00,618 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.99 vs. limit=22.5 2023-06-28 01:38:34,345 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.763e+02 7.359e+02 1.013e+03 1.496e+03 3.084e+03, threshold=2.026e+03, percent-clipped=8.0 2023-06-28 01:39:43,728 INFO [train.py:996] (2/4) Epoch 11, batch 19100, loss[loss=0.2093, simple_loss=0.2694, pruned_loss=0.0746, over 21473.00 frames. ], tot_loss[loss=0.2095, simple_loss=0.2851, pruned_loss=0.06697, over 4287165.80 frames. ], batch size: 441, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:39:48,960 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1944276.0, ans=0.0 2023-06-28 01:40:40,349 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1944396.0, ans=0.125 2023-06-28 01:41:18,726 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1944516.0, ans=0.125 2023-06-28 01:41:25,649 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1944516.0, ans=0.0 2023-06-28 01:41:33,379 INFO [train.py:996] (2/4) Epoch 11, batch 19150, loss[loss=0.2837, simple_loss=0.3798, pruned_loss=0.09382, over 21667.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.2912, pruned_loss=0.06876, over 4278455.38 frames. ], batch size: 414, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:42:09,610 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.832e+02 7.660e+02 1.202e+03 2.015e+03 4.043e+03, threshold=2.404e+03, percent-clipped=23.0 2023-06-28 01:42:39,156 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1944756.0, ans=0.125 2023-06-28 01:42:44,195 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1944756.0, ans=0.125 2023-06-28 01:43:15,173 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1944816.0, ans=0.125 2023-06-28 01:43:19,389 INFO [train.py:996] (2/4) Epoch 11, batch 19200, loss[loss=0.2922, simple_loss=0.3857, pruned_loss=0.09932, over 21636.00 frames. ], tot_loss[loss=0.2214, simple_loss=0.3023, pruned_loss=0.07027, over 4275950.92 frames. ], batch size: 441, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:44:12,893 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.23 vs. limit=6.0 2023-06-28 01:44:36,162 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1945056.0, ans=0.0 2023-06-28 01:44:44,115 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1945116.0, ans=0.125 2023-06-28 01:44:49,374 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1945116.0, ans=0.0 2023-06-28 01:45:01,780 INFO [train.py:996] (2/4) Epoch 11, batch 19250, loss[loss=0.1824, simple_loss=0.2713, pruned_loss=0.04671, over 21834.00 frames. ], tot_loss[loss=0.2149, simple_loss=0.2999, pruned_loss=0.06495, over 4282505.59 frames. ], batch size: 282, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:45:08,852 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1945176.0, ans=0.035 2023-06-28 01:45:36,161 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.112e+02 6.434e+02 9.084e+02 1.292e+03 2.942e+03, threshold=1.817e+03, percent-clipped=2.0 2023-06-28 01:46:15,640 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1945356.0, ans=0.0 2023-06-28 01:46:27,280 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1945416.0, ans=0.125 2023-06-28 01:46:32,648 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.95 vs. limit=15.0 2023-06-28 01:46:43,082 INFO [train.py:996] (2/4) Epoch 11, batch 19300, loss[loss=0.1995, simple_loss=0.2809, pruned_loss=0.05903, over 21817.00 frames. ], tot_loss[loss=0.212, simple_loss=0.2962, pruned_loss=0.06394, over 4290218.93 frames. ], batch size: 351, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:46:53,944 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1945476.0, ans=0.1 2023-06-28 01:47:15,311 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1945536.0, ans=0.2 2023-06-28 01:47:15,460 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1945536.0, ans=0.0 2023-06-28 01:48:08,818 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.06 vs. limit=15.0 2023-06-28 01:48:25,861 INFO [train.py:996] (2/4) Epoch 11, batch 19350, loss[loss=0.1915, simple_loss=0.2782, pruned_loss=0.05237, over 21646.00 frames. ], tot_loss[loss=0.2063, simple_loss=0.2914, pruned_loss=0.06062, over 4280969.01 frames. ], batch size: 263, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:48:52,960 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1945836.0, ans=0.125 2023-06-28 01:49:06,755 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.646e+02 6.544e+02 1.045e+03 1.616e+03 2.621e+03, threshold=2.089e+03, percent-clipped=15.0 2023-06-28 01:49:55,858 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1946016.0, ans=0.0 2023-06-28 01:50:06,771 INFO [train.py:996] (2/4) Epoch 11, batch 19400, loss[loss=0.2168, simple_loss=0.2945, pruned_loss=0.06953, over 21417.00 frames. ], tot_loss[loss=0.205, simple_loss=0.2898, pruned_loss=0.06008, over 4285348.05 frames. ], batch size: 131, lr: 2.64e-03, grad_scale: 8.0 2023-06-28 01:50:49,699 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1946196.0, ans=0.0 2023-06-28 01:50:49,763 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1946196.0, ans=0.2 2023-06-28 01:51:31,418 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.74 vs. limit=6.0 2023-06-28 01:51:48,635 INFO [train.py:996] (2/4) Epoch 11, batch 19450, loss[loss=0.2195, simple_loss=0.2771, pruned_loss=0.08093, over 21593.00 frames. ], tot_loss[loss=0.205, simple_loss=0.2865, pruned_loss=0.06176, over 4286635.10 frames. ], batch size: 441, lr: 2.63e-03, grad_scale: 8.0 2023-06-28 01:52:16,007 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1946436.0, ans=0.125 2023-06-28 01:52:21,389 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1946436.0, ans=0.0 2023-06-28 01:52:30,297 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.692e+02 7.227e+02 1.148e+03 1.482e+03 2.916e+03, threshold=2.296e+03, percent-clipped=8.0 2023-06-28 01:52:34,076 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1946496.0, ans=0.125 2023-06-28 01:53:32,680 INFO [train.py:996] (2/4) Epoch 11, batch 19500, loss[loss=0.1941, simple_loss=0.2731, pruned_loss=0.05758, over 20721.00 frames. ], tot_loss[loss=0.2037, simple_loss=0.2822, pruned_loss=0.06258, over 4284452.31 frames. ], batch size: 607, lr: 2.63e-03, grad_scale: 8.0 2023-06-28 01:54:07,793 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.15 vs. limit=15.0 2023-06-28 01:54:57,684 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.88 vs. limit=22.5 2023-06-28 01:55:00,288 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1946916.0, ans=0.025 2023-06-28 01:55:16,434 INFO [train.py:996] (2/4) Epoch 11, batch 19550, loss[loss=0.1691, simple_loss=0.2586, pruned_loss=0.03978, over 21825.00 frames. ], tot_loss[loss=0.2001, simple_loss=0.2777, pruned_loss=0.06125, over 4276187.03 frames. ], batch size: 316, lr: 2.63e-03, grad_scale: 8.0 2023-06-28 01:55:57,083 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.189e+02 6.305e+02 9.070e+02 1.284e+03 2.823e+03, threshold=1.814e+03, percent-clipped=4.0 2023-06-28 01:56:05,421 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1947096.0, ans=0.2 2023-06-28 01:56:17,264 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1947096.0, ans=0.125 2023-06-28 01:56:26,867 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1947156.0, ans=0.125 2023-06-28 01:56:31,947 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1947156.0, ans=0.0 2023-06-28 01:56:57,987 INFO [train.py:996] (2/4) Epoch 11, batch 19600, loss[loss=0.2456, simple_loss=0.3223, pruned_loss=0.0845, over 21481.00 frames. ], tot_loss[loss=0.2016, simple_loss=0.2795, pruned_loss=0.0619, over 4280823.92 frames. ], batch size: 131, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 01:57:13,433 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1947276.0, ans=0.1 2023-06-28 01:57:49,808 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1947396.0, ans=0.1 2023-06-28 01:58:24,537 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1947516.0, ans=0.0 2023-06-28 01:58:35,318 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1947516.0, ans=0.0 2023-06-28 01:58:43,063 INFO [train.py:996] (2/4) Epoch 11, batch 19650, loss[loss=0.2443, simple_loss=0.3254, pruned_loss=0.08162, over 21845.00 frames. ], tot_loss[loss=0.2068, simple_loss=0.2842, pruned_loss=0.06469, over 4277055.03 frames. ], batch size: 118, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 01:58:50,005 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1947576.0, ans=0.2 2023-06-28 01:59:23,486 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1947636.0, ans=0.0 2023-06-28 01:59:29,704 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.082e+02 7.409e+02 1.104e+03 1.587e+03 3.520e+03, threshold=2.207e+03, percent-clipped=14.0 2023-06-28 02:00:32,469 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=1947816.0, ans=15.0 2023-06-28 02:00:39,365 INFO [train.py:996] (2/4) Epoch 11, batch 19700, loss[loss=0.2615, simple_loss=0.3433, pruned_loss=0.08978, over 21513.00 frames. ], tot_loss[loss=0.2092, simple_loss=0.2872, pruned_loss=0.06557, over 4278687.30 frames. ], batch size: 508, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:01:49,684 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1948056.0, ans=0.0 2023-06-28 02:02:18,163 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.06 vs. limit=15.0 2023-06-28 02:02:28,057 INFO [train.py:996] (2/4) Epoch 11, batch 19750, loss[loss=0.2206, simple_loss=0.3094, pruned_loss=0.06588, over 21284.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2968, pruned_loss=0.06711, over 4272292.46 frames. ], batch size: 159, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:02:59,021 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1948236.0, ans=0.0 2023-06-28 02:03:04,771 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.869e+02 8.060e+02 1.121e+03 1.722e+03 5.088e+03, threshold=2.243e+03, percent-clipped=14.0 2023-06-28 02:03:40,542 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1948356.0, ans=0.04949747468305833 2023-06-28 02:03:47,066 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1948416.0, ans=0.0 2023-06-28 02:04:10,955 INFO [train.py:996] (2/4) Epoch 11, batch 19800, loss[loss=0.2023, simple_loss=0.2699, pruned_loss=0.06734, over 21806.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.2951, pruned_loss=0.0671, over 4270595.31 frames. ], batch size: 124, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:04:27,283 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.71 vs. limit=15.0 2023-06-28 02:04:44,635 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1948536.0, ans=0.0 2023-06-28 02:05:00,135 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1948596.0, ans=0.2 2023-06-28 02:05:08,076 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1948596.0, ans=0.2 2023-06-28 02:05:08,134 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1948596.0, ans=0.125 2023-06-28 02:05:25,352 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.01 vs. limit=15.0 2023-06-28 02:05:57,870 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1948716.0, ans=0.1 2023-06-28 02:06:00,846 INFO [train.py:996] (2/4) Epoch 11, batch 19850, loss[loss=0.1712, simple_loss=0.2396, pruned_loss=0.0514, over 21411.00 frames. ], tot_loss[loss=0.2069, simple_loss=0.2879, pruned_loss=0.06298, over 4271055.31 frames. ], batch size: 131, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:06:01,648 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 02:06:23,277 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 02:06:32,739 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.738e+02 8.105e+02 1.255e+03 1.783e+03 2.882e+03, threshold=2.510e+03, percent-clipped=10.0 2023-06-28 02:06:34,044 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.98 vs. limit=15.0 2023-06-28 02:07:39,921 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1949016.0, ans=0.0 2023-06-28 02:07:42,434 INFO [train.py:996] (2/4) Epoch 11, batch 19900, loss[loss=0.2046, simple_loss=0.2773, pruned_loss=0.06594, over 21122.00 frames. ], tot_loss[loss=0.206, simple_loss=0.2894, pruned_loss=0.0613, over 4261892.78 frames. ], batch size: 159, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:08:09,892 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.68 vs. limit=15.0 2023-06-28 02:08:16,266 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1949196.0, ans=0.125 2023-06-28 02:08:17,870 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1949196.0, ans=0.2 2023-06-28 02:08:35,870 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1949256.0, ans=0.125 2023-06-28 02:09:19,650 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1949316.0, ans=0.125 2023-06-28 02:09:25,677 INFO [train.py:996] (2/4) Epoch 11, batch 19950, loss[loss=0.2038, simple_loss=0.2727, pruned_loss=0.06746, over 21763.00 frames. ], tot_loss[loss=0.202, simple_loss=0.2832, pruned_loss=0.06043, over 4258701.70 frames. ], batch size: 102, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:09:51,896 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1949436.0, ans=0.125 2023-06-28 02:09:58,011 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.624e+02 6.449e+02 8.969e+02 1.295e+03 2.845e+03, threshold=1.794e+03, percent-clipped=2.0 2023-06-28 02:10:11,595 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1949496.0, ans=0.2 2023-06-28 02:10:30,015 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.35 vs. limit=15.0 2023-06-28 02:11:07,755 INFO [train.py:996] (2/4) Epoch 11, batch 20000, loss[loss=0.2361, simple_loss=0.3052, pruned_loss=0.08343, over 21751.00 frames. ], tot_loss[loss=0.2026, simple_loss=0.2829, pruned_loss=0.06109, over 4258537.99 frames. ], batch size: 441, lr: 2.63e-03, grad_scale: 32.0 2023-06-28 02:11:25,703 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1949736.0, ans=0.1 2023-06-28 02:11:55,331 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.44 vs. limit=22.5 2023-06-28 02:12:49,282 INFO [train.py:996] (2/4) Epoch 11, batch 20050, loss[loss=0.2268, simple_loss=0.2973, pruned_loss=0.07818, over 21728.00 frames. ], tot_loss[loss=0.2057, simple_loss=0.2851, pruned_loss=0.06315, over 4272300.13 frames. ], batch size: 389, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:13:27,885 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.048e+02 6.719e+02 1.022e+03 1.464e+03 2.848e+03, threshold=2.043e+03, percent-clipped=12.0 2023-06-28 02:13:36,595 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1950096.0, ans=0.1 2023-06-28 02:14:25,034 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1950216.0, ans=0.125 2023-06-28 02:14:26,655 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1950216.0, ans=0.0 2023-06-28 02:14:33,061 INFO [train.py:996] (2/4) Epoch 11, batch 20100, loss[loss=0.2055, simple_loss=0.283, pruned_loss=0.06399, over 21118.00 frames. ], tot_loss[loss=0.2081, simple_loss=0.2867, pruned_loss=0.06473, over 4281220.19 frames. ], batch size: 143, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:14:33,749 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1950276.0, ans=0.09899494936611666 2023-06-28 02:14:39,294 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.11 vs. limit=15.0 2023-06-28 02:14:47,327 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1950276.0, ans=0.125 2023-06-28 02:15:22,003 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1950396.0, ans=0.125 2023-06-28 02:15:53,674 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.16 vs. limit=22.5 2023-06-28 02:16:10,582 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1950516.0, ans=0.0 2023-06-28 02:16:16,910 INFO [train.py:996] (2/4) Epoch 11, batch 20150, loss[loss=0.2472, simple_loss=0.3265, pruned_loss=0.08392, over 21691.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.2966, pruned_loss=0.06853, over 4283575.15 frames. ], batch size: 351, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:16:26,439 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.24 vs. limit=15.0 2023-06-28 02:16:27,656 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1950576.0, ans=0.0 2023-06-28 02:17:06,273 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.168e+02 7.352e+02 1.035e+03 1.689e+03 3.687e+03, threshold=2.071e+03, percent-clipped=15.0 2023-06-28 02:17:18,700 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1950696.0, ans=0.1 2023-06-28 02:17:58,565 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1950816.0, ans=0.1 2023-06-28 02:18:07,648 INFO [train.py:996] (2/4) Epoch 11, batch 20200, loss[loss=0.2475, simple_loss=0.3755, pruned_loss=0.05976, over 19868.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.3036, pruned_loss=0.07178, over 4275631.36 frames. ], batch size: 702, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:18:08,312 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1950876.0, ans=0.125 2023-06-28 02:19:19,882 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1951056.0, ans=0.125 2023-06-28 02:19:38,164 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1951116.0, ans=0.0 2023-06-28 02:19:51,072 INFO [train.py:996] (2/4) Epoch 11, batch 20250, loss[loss=0.1883, simple_loss=0.2739, pruned_loss=0.05133, over 21427.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.3036, pruned_loss=0.07068, over 4275764.46 frames. ], batch size: 211, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:19:53,532 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1951176.0, ans=0.2 2023-06-28 02:20:22,690 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.34 vs. limit=15.0 2023-06-28 02:20:33,652 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1951236.0, ans=0.0 2023-06-28 02:20:39,494 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.229e+02 6.229e+02 9.670e+02 1.265e+03 2.835e+03, threshold=1.934e+03, percent-clipped=7.0 2023-06-28 02:21:13,067 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1951356.0, ans=0.0 2023-06-28 02:21:19,554 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1951416.0, ans=0.125 2023-06-28 02:21:25,777 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1951416.0, ans=0.125 2023-06-28 02:21:37,853 INFO [train.py:996] (2/4) Epoch 11, batch 20300, loss[loss=0.2062, simple_loss=0.3115, pruned_loss=0.05042, over 20868.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.301, pruned_loss=0.0671, over 4265030.76 frames. ], batch size: 608, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:21:53,172 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1951476.0, ans=0.0 2023-06-28 02:21:56,341 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1951476.0, ans=0.125 2023-06-28 02:22:57,770 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1951716.0, ans=0.125 2023-06-28 02:23:12,402 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1951776.0, ans=0.1 2023-06-28 02:23:13,332 INFO [train.py:996] (2/4) Epoch 11, batch 20350, loss[loss=0.1953, simple_loss=0.2821, pruned_loss=0.05423, over 21866.00 frames. ], tot_loss[loss=0.216, simple_loss=0.2992, pruned_loss=0.06637, over 4246731.04 frames. ], batch size: 102, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:23:17,322 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1951776.0, ans=0.04949747468305833 2023-06-28 02:23:28,656 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1951776.0, ans=0.0 2023-06-28 02:23:46,965 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.74 vs. limit=15.0 2023-06-28 02:24:01,023 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.709e+02 6.451e+02 8.868e+02 1.412e+03 2.811e+03, threshold=1.774e+03, percent-clipped=7.0 2023-06-28 02:24:09,918 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1951896.0, ans=0.1 2023-06-28 02:24:11,565 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1951896.0, ans=0.125 2023-06-28 02:24:47,173 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1952016.0, ans=0.125 2023-06-28 02:24:56,171 INFO [train.py:996] (2/4) Epoch 11, batch 20400, loss[loss=0.2564, simple_loss=0.3275, pruned_loss=0.09263, over 21271.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.3007, pruned_loss=0.06826, over 4236882.28 frames. ], batch size: 159, lr: 2.63e-03, grad_scale: 32.0 2023-06-28 02:25:08,466 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.24 vs. limit=22.5 2023-06-28 02:25:15,638 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1952076.0, ans=0.2 2023-06-28 02:25:19,041 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1952076.0, ans=0.0 2023-06-28 02:25:20,916 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1952136.0, ans=0.0 2023-06-28 02:25:40,652 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1952136.0, ans=0.125 2023-06-28 02:25:50,272 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1952196.0, ans=0.125 2023-06-28 02:26:03,401 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1952256.0, ans=0.2 2023-06-28 02:26:37,014 INFO [train.py:996] (2/4) Epoch 11, batch 20450, loss[loss=0.2167, simple_loss=0.2941, pruned_loss=0.06965, over 21493.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.3028, pruned_loss=0.07105, over 4240404.75 frames. ], batch size: 131, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:27:24,030 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1952496.0, ans=0.125 2023-06-28 02:27:25,095 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.694e+02 8.138e+02 1.140e+03 1.534e+03 2.680e+03, threshold=2.280e+03, percent-clipped=12.0 2023-06-28 02:27:54,176 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1952556.0, ans=0.125 2023-06-28 02:28:13,339 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1952616.0, ans=0.1 2023-06-28 02:28:15,190 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1952616.0, ans=0.125 2023-06-28 02:28:17,758 INFO [train.py:996] (2/4) Epoch 11, batch 20500, loss[loss=0.225, simple_loss=0.2926, pruned_loss=0.0787, over 21725.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.2989, pruned_loss=0.0715, over 4256181.01 frames. ], batch size: 414, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:28:53,245 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1952736.0, ans=0.035 2023-06-28 02:29:02,974 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1952796.0, ans=0.1 2023-06-28 02:29:14,932 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1952796.0, ans=0.0 2023-06-28 02:30:04,138 INFO [train.py:996] (2/4) Epoch 11, batch 20550, loss[loss=0.2428, simple_loss=0.3545, pruned_loss=0.06554, over 19831.00 frames. ], tot_loss[loss=0.216, simple_loss=0.2919, pruned_loss=0.07007, over 4258858.52 frames. ], batch size: 703, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:30:16,389 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1952976.0, ans=0.125 2023-06-28 02:30:43,037 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1953036.0, ans=0.0 2023-06-28 02:30:44,762 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1953036.0, ans=0.125 2023-06-28 02:30:49,265 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.859e+02 7.218e+02 1.038e+03 1.367e+03 4.804e+03, threshold=2.077e+03, percent-clipped=4.0 2023-06-28 02:30:56,168 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1953096.0, ans=0.125 2023-06-28 02:31:31,814 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1953216.0, ans=0.0 2023-06-28 02:31:42,649 INFO [train.py:996] (2/4) Epoch 11, batch 20600, loss[loss=0.2161, simple_loss=0.2947, pruned_loss=0.06877, over 21840.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.2956, pruned_loss=0.06904, over 4265125.06 frames. ], batch size: 332, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:32:25,169 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1953336.0, ans=0.1 2023-06-28 02:32:34,932 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1953396.0, ans=0.125 2023-06-28 02:33:01,281 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1953456.0, ans=0.125 2023-06-28 02:33:04,460 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1953456.0, ans=0.0 2023-06-28 02:33:28,458 INFO [train.py:996] (2/4) Epoch 11, batch 20650, loss[loss=0.1877, simple_loss=0.2399, pruned_loss=0.0678, over 20218.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.2911, pruned_loss=0.06882, over 4266088.44 frames. ], batch size: 703, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:33:37,506 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.59 vs. limit=15.0 2023-06-28 02:33:55,222 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1953636.0, ans=0.125 2023-06-28 02:33:56,885 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1953636.0, ans=0.015 2023-06-28 02:34:12,062 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1953696.0, ans=0.5 2023-06-28 02:34:13,040 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.450e+02 6.069e+02 8.420e+02 1.112e+03 2.688e+03, threshold=1.684e+03, percent-clipped=4.0 2023-06-28 02:35:04,597 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1953816.0, ans=0.125 2023-06-28 02:35:11,584 INFO [train.py:996] (2/4) Epoch 11, batch 20700, loss[loss=0.1776, simple_loss=0.2508, pruned_loss=0.05223, over 21265.00 frames. ], tot_loss[loss=0.2069, simple_loss=0.2831, pruned_loss=0.06538, over 4252611.54 frames. ], batch size: 176, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:35:21,846 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1953876.0, ans=0.0 2023-06-28 02:37:05,867 INFO [train.py:996] (2/4) Epoch 11, batch 20750, loss[loss=0.1915, simple_loss=0.2767, pruned_loss=0.05312, over 21598.00 frames. ], tot_loss[loss=0.2084, simple_loss=0.2861, pruned_loss=0.06537, over 4250405.40 frames. ], batch size: 230, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:37:09,756 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1954176.0, ans=0.125 2023-06-28 02:37:46,763 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.132e+02 6.969e+02 1.049e+03 1.420e+03 3.386e+03, threshold=2.099e+03, percent-clipped=18.0 2023-06-28 02:38:48,453 INFO [train.py:996] (2/4) Epoch 11, batch 20800, loss[loss=0.1814, simple_loss=0.2522, pruned_loss=0.05533, over 21405.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.2896, pruned_loss=0.06632, over 4251932.98 frames. ], batch size: 194, lr: 2.63e-03, grad_scale: 32.0 2023-06-28 02:39:03,023 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.57 vs. limit=15.0 2023-06-28 02:39:11,430 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.38 vs. limit=22.5 2023-06-28 02:39:21,858 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1954536.0, ans=0.0 2023-06-28 02:39:24,320 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.24 vs. limit=15.0 2023-06-28 02:39:39,800 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1954596.0, ans=0.125 2023-06-28 02:39:41,376 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1954656.0, ans=0.1 2023-06-28 02:40:30,173 INFO [train.py:996] (2/4) Epoch 11, batch 20850, loss[loss=0.1561, simple_loss=0.2296, pruned_loss=0.04133, over 21743.00 frames. ], tot_loss[loss=0.2059, simple_loss=0.2827, pruned_loss=0.06456, over 4255983.41 frames. ], batch size: 247, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:41:11,761 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.093e+02 6.813e+02 9.986e+02 1.626e+03 4.926e+03, threshold=1.997e+03, percent-clipped=17.0 2023-06-28 02:41:49,887 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1955016.0, ans=0.125 2023-06-28 02:42:12,919 INFO [train.py:996] (2/4) Epoch 11, batch 20900, loss[loss=0.184, simple_loss=0.2706, pruned_loss=0.0487, over 21611.00 frames. ], tot_loss[loss=0.2067, simple_loss=0.2835, pruned_loss=0.06497, over 4265060.83 frames. ], batch size: 230, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:42:19,955 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 02:42:47,483 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1955136.0, ans=0.125 2023-06-28 02:42:54,246 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1955196.0, ans=0.125 2023-06-28 02:43:00,633 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1955196.0, ans=0.2 2023-06-28 02:43:14,977 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1955256.0, ans=0.2 2023-06-28 02:43:16,594 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1955256.0, ans=0.2 2023-06-28 02:43:17,286 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.44 vs. limit=15.0 2023-06-28 02:43:36,511 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1955316.0, ans=0.125 2023-06-28 02:43:46,928 INFO [train.py:996] (2/4) Epoch 11, batch 20950, loss[loss=0.1742, simple_loss=0.2466, pruned_loss=0.05086, over 21074.00 frames. ], tot_loss[loss=0.2027, simple_loss=0.2804, pruned_loss=0.06248, over 4253241.14 frames. ], batch size: 608, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:44:01,625 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1955376.0, ans=0.1 2023-06-28 02:44:25,788 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1955496.0, ans=0.2 2023-06-28 02:44:26,754 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.631e+02 6.600e+02 1.009e+03 1.481e+03 3.746e+03, threshold=2.018e+03, percent-clipped=8.0 2023-06-28 02:44:51,629 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1955556.0, ans=0.1 2023-06-28 02:45:05,382 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.94 vs. limit=15.0 2023-06-28 02:45:12,810 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1955616.0, ans=0.2 2023-06-28 02:45:25,836 INFO [train.py:996] (2/4) Epoch 11, batch 21000, loss[loss=0.1969, simple_loss=0.3106, pruned_loss=0.04159, over 19877.00 frames. ], tot_loss[loss=0.2017, simple_loss=0.2788, pruned_loss=0.06227, over 4263364.33 frames. ], batch size: 702, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:45:25,837 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-28 02:45:45,782 INFO [train.py:1028] (2/4) Epoch 11, validation: loss=0.2661, simple_loss=0.3574, pruned_loss=0.08743, over 1796401.00 frames. 2023-06-28 02:45:45,783 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 23793MB 2023-06-28 02:46:12,490 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1955736.0, ans=0.0 2023-06-28 02:46:15,923 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1955736.0, ans=0.125 2023-06-28 02:46:23,819 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1955796.0, ans=0.125 2023-06-28 02:46:38,844 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1955856.0, ans=0.125 2023-06-28 02:46:42,763 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn2.whiten.whitening_limit, batch_count=1955856.0, ans=22.5 2023-06-28 02:46:45,428 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1955856.0, ans=0.0 2023-06-28 02:47:18,663 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1955916.0, ans=0.07 2023-06-28 02:47:22,982 INFO [train.py:996] (2/4) Epoch 11, batch 21050, loss[loss=0.2114, simple_loss=0.2818, pruned_loss=0.0705, over 21885.00 frames. ], tot_loss[loss=0.2004, simple_loss=0.2765, pruned_loss=0.06215, over 4263599.74 frames. ], batch size: 98, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:47:44,929 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1956036.0, ans=10.0 2023-06-28 02:47:55,734 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.65 vs. limit=15.0 2023-06-28 02:48:06,345 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1956096.0, ans=0.125 2023-06-28 02:48:09,003 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.407e+02 6.035e+02 7.930e+02 1.297e+03 2.545e+03, threshold=1.586e+03, percent-clipped=7.0 2023-06-28 02:49:04,829 INFO [train.py:996] (2/4) Epoch 11, batch 21100, loss[loss=0.1663, simple_loss=0.2265, pruned_loss=0.05302, over 21204.00 frames. ], tot_loss[loss=0.199, simple_loss=0.2731, pruned_loss=0.0625, over 4249377.90 frames. ], batch size: 548, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:49:56,225 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1956396.0, ans=0.125 2023-06-28 02:49:59,348 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1956396.0, ans=0.125 2023-06-28 02:50:17,550 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.60 vs. limit=15.0 2023-06-28 02:50:40,602 INFO [train.py:996] (2/4) Epoch 11, batch 21150, loss[loss=0.2009, simple_loss=0.2584, pruned_loss=0.07169, over 21266.00 frames. ], tot_loss[loss=0.1975, simple_loss=0.2701, pruned_loss=0.06247, over 4251056.80 frames. ], batch size: 144, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:50:53,363 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.40 vs. limit=10.0 2023-06-28 02:51:25,660 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=1956696.0, ans=15.0 2023-06-28 02:51:26,154 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.779e+02 6.313e+02 9.139e+02 1.246e+03 3.367e+03, threshold=1.828e+03, percent-clipped=14.0 2023-06-28 02:51:33,587 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.95 vs. limit=12.0 2023-06-28 02:51:49,512 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1956756.0, ans=0.0 2023-06-28 02:52:16,464 INFO [train.py:996] (2/4) Epoch 11, batch 21200, loss[loss=0.1708, simple_loss=0.245, pruned_loss=0.04829, over 21201.00 frames. ], tot_loss[loss=0.1944, simple_loss=0.2659, pruned_loss=0.06145, over 4257277.57 frames. ], batch size: 548, lr: 2.63e-03, grad_scale: 32.0 2023-06-28 02:52:25,656 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1956876.0, ans=0.0 2023-06-28 02:52:31,934 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1956876.0, ans=0.025 2023-06-28 02:53:15,025 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1957056.0, ans=0.0 2023-06-28 02:53:55,157 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1957116.0, ans=0.125 2023-06-28 02:53:58,175 INFO [train.py:996] (2/4) Epoch 11, batch 21250, loss[loss=0.1971, simple_loss=0.2719, pruned_loss=0.06117, over 21493.00 frames. ], tot_loss[loss=0.1944, simple_loss=0.2648, pruned_loss=0.06196, over 4253306.94 frames. ], batch size: 212, lr: 2.63e-03, grad_scale: 8.0 2023-06-28 02:54:47,826 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.642e+02 7.284e+02 1.070e+03 1.587e+03 2.954e+03, threshold=2.141e+03, percent-clipped=16.0 2023-06-28 02:54:48,760 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1957296.0, ans=0.125 2023-06-28 02:54:53,530 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1957296.0, ans=0.125 2023-06-28 02:55:39,439 INFO [train.py:996] (2/4) Epoch 11, batch 21300, loss[loss=0.2442, simple_loss=0.3197, pruned_loss=0.08431, over 21771.00 frames. ], tot_loss[loss=0.2014, simple_loss=0.2733, pruned_loss=0.06478, over 4261087.40 frames. ], batch size: 441, lr: 2.63e-03, grad_scale: 8.0 2023-06-28 02:55:49,598 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1957476.0, ans=0.2 2023-06-28 02:56:27,308 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.17 vs. limit=22.5 2023-06-28 02:56:29,836 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1957596.0, ans=0.125 2023-06-28 02:56:34,627 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1957596.0, ans=0.125 2023-06-28 02:57:22,792 INFO [train.py:996] (2/4) Epoch 11, batch 21350, loss[loss=0.1849, simple_loss=0.2615, pruned_loss=0.0542, over 21769.00 frames. ], tot_loss[loss=0.203, simple_loss=0.2768, pruned_loss=0.06465, over 4265668.29 frames. ], batch size: 112, lr: 2.63e-03, grad_scale: 8.0 2023-06-28 02:58:08,237 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.369e+02 7.362e+02 1.168e+03 1.519e+03 3.106e+03, threshold=2.337e+03, percent-clipped=14.0 2023-06-28 02:58:47,524 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1958016.0, ans=0.1 2023-06-28 02:58:59,682 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 02:59:07,585 INFO [train.py:996] (2/4) Epoch 11, batch 21400, loss[loss=0.2411, simple_loss=0.3182, pruned_loss=0.08203, over 21296.00 frames. ], tot_loss[loss=0.2037, simple_loss=0.2799, pruned_loss=0.06373, over 4267811.24 frames. ], batch size: 159, lr: 2.63e-03, grad_scale: 8.0 2023-06-28 02:59:36,706 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.46 vs. limit=6.0 2023-06-28 02:59:41,138 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1958136.0, ans=0.125 2023-06-28 02:59:43,561 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.98 vs. limit=15.0 2023-06-28 03:00:38,067 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1958316.0, ans=0.125 2023-06-28 03:00:44,828 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1958316.0, ans=0.125 2023-06-28 03:00:49,143 INFO [train.py:996] (2/4) Epoch 11, batch 21450, loss[loss=0.2153, simple_loss=0.2913, pruned_loss=0.06966, over 21736.00 frames. ], tot_loss[loss=0.2076, simple_loss=0.2836, pruned_loss=0.06582, over 4272964.55 frames. ], batch size: 389, lr: 2.63e-03, grad_scale: 8.0 2023-06-28 03:01:33,850 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.846e+02 6.247e+02 7.898e+02 1.203e+03 2.207e+03, threshold=1.580e+03, percent-clipped=0.0 2023-06-28 03:01:44,213 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1958496.0, ans=0.125 2023-06-28 03:02:30,237 INFO [train.py:996] (2/4) Epoch 11, batch 21500, loss[loss=0.188, simple_loss=0.2541, pruned_loss=0.06093, over 21191.00 frames. ], tot_loss[loss=0.2071, simple_loss=0.2816, pruned_loss=0.06628, over 4280189.00 frames. ], batch size: 176, lr: 2.63e-03, grad_scale: 8.0 2023-06-28 03:02:59,800 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1958736.0, ans=0.025 2023-06-28 03:03:51,723 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1958856.0, ans=0.125 2023-06-28 03:04:08,463 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1958916.0, ans=0.125 2023-06-28 03:04:11,244 INFO [train.py:996] (2/4) Epoch 11, batch 21550, loss[loss=0.2099, simple_loss=0.322, pruned_loss=0.04884, over 19798.00 frames. ], tot_loss[loss=0.202, simple_loss=0.2762, pruned_loss=0.06391, over 4270681.06 frames. ], batch size: 703, lr: 2.63e-03, grad_scale: 8.0 2023-06-28 03:04:20,065 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1958976.0, ans=0.125 2023-06-28 03:04:56,025 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.241e+02 6.233e+02 9.531e+02 1.253e+03 2.671e+03, threshold=1.906e+03, percent-clipped=10.0 2023-06-28 03:05:28,566 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.91 vs. limit=10.0 2023-06-28 03:05:41,966 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1959216.0, ans=0.0 2023-06-28 03:05:42,525 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.99 vs. limit=15.0 2023-06-28 03:05:49,841 INFO [train.py:996] (2/4) Epoch 11, batch 21600, loss[loss=0.1875, simple_loss=0.2502, pruned_loss=0.06245, over 21591.00 frames. ], tot_loss[loss=0.1988, simple_loss=0.2719, pruned_loss=0.06287, over 4262978.06 frames. ], batch size: 415, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 03:06:12,561 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1959336.0, ans=0.2 2023-06-28 03:06:20,684 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1959336.0, ans=0.05 2023-06-28 03:07:37,240 INFO [train.py:996] (2/4) Epoch 11, batch 21650, loss[loss=0.2149, simple_loss=0.3021, pruned_loss=0.06382, over 21291.00 frames. ], tot_loss[loss=0.2005, simple_loss=0.2773, pruned_loss=0.06182, over 4258060.08 frames. ], batch size: 176, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 03:08:26,011 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.567e+02 7.101e+02 1.132e+03 1.604e+03 3.542e+03, threshold=2.263e+03, percent-clipped=14.0 2023-06-28 03:09:10,991 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1959816.0, ans=0.2 2023-06-28 03:09:18,420 INFO [train.py:996] (2/4) Epoch 11, batch 21700, loss[loss=0.2054, simple_loss=0.2716, pruned_loss=0.06965, over 21748.00 frames. ], tot_loss[loss=0.1997, simple_loss=0.2782, pruned_loss=0.06066, over 4264489.73 frames. ], batch size: 371, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 03:09:31,973 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1959876.0, ans=0.125 2023-06-28 03:10:30,605 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 03:11:00,014 INFO [train.py:996] (2/4) Epoch 11, batch 21750, loss[loss=0.1998, simple_loss=0.2672, pruned_loss=0.06614, over 16334.00 frames. ], tot_loss[loss=0.1974, simple_loss=0.2743, pruned_loss=0.06024, over 4252142.93 frames. ], batch size: 65, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 03:11:28,095 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1960236.0, ans=0.0 2023-06-28 03:11:43,944 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.642e+02 7.621e+02 1.214e+03 1.880e+03 3.851e+03, threshold=2.427e+03, percent-clipped=16.0 2023-06-28 03:11:44,666 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1960296.0, ans=0.125 2023-06-28 03:11:49,561 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1960296.0, ans=0.125 2023-06-28 03:12:04,373 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1960356.0, ans=0.125 2023-06-28 03:12:11,347 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1960356.0, ans=0.125 2023-06-28 03:12:37,219 INFO [train.py:996] (2/4) Epoch 11, batch 21800, loss[loss=0.1894, simple_loss=0.2587, pruned_loss=0.0601, over 21791.00 frames. ], tot_loss[loss=0.1971, simple_loss=0.2723, pruned_loss=0.06093, over 4248598.65 frames. ], batch size: 118, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 03:12:38,070 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1960476.0, ans=0.0 2023-06-28 03:13:00,071 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1960536.0, ans=0.125 2023-06-28 03:13:24,985 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1960596.0, ans=0.125 2023-06-28 03:13:41,288 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1960656.0, ans=0.125 2023-06-28 03:14:15,423 INFO [train.py:996] (2/4) Epoch 11, batch 21850, loss[loss=0.2036, simple_loss=0.3259, pruned_loss=0.0407, over 20802.00 frames. ], tot_loss[loss=0.2003, simple_loss=0.2777, pruned_loss=0.06143, over 4248996.62 frames. ], batch size: 607, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 03:14:16,261 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1960776.0, ans=0.0 2023-06-28 03:14:22,836 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1960776.0, ans=0.0 2023-06-28 03:14:36,149 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1960836.0, ans=0.1 2023-06-28 03:14:49,055 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.23 vs. limit=8.0 2023-06-28 03:14:56,123 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1960896.0, ans=0.125 2023-06-28 03:15:00,558 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.488e+02 6.380e+02 8.991e+02 1.412e+03 2.394e+03, threshold=1.798e+03, percent-clipped=0.0 2023-06-28 03:15:13,107 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.54 vs. limit=15.0 2023-06-28 03:15:20,758 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1960956.0, ans=0.125 2023-06-28 03:15:23,025 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.18 vs. limit=6.0 2023-06-28 03:15:52,999 INFO [train.py:996] (2/4) Epoch 11, batch 21900, loss[loss=0.2245, simple_loss=0.2967, pruned_loss=0.07616, over 21815.00 frames. ], tot_loss[loss=0.2018, simple_loss=0.2791, pruned_loss=0.06228, over 4256905.20 frames. ], batch size: 441, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 03:16:45,211 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1961196.0, ans=0.125 2023-06-28 03:16:59,355 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1961256.0, ans=0.125 2023-06-28 03:17:07,539 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1961316.0, ans=0.125 2023-06-28 03:17:09,108 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1961316.0, ans=0.0 2023-06-28 03:17:28,934 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1961376.0, ans=0.0 2023-06-28 03:17:29,754 INFO [train.py:996] (2/4) Epoch 11, batch 21950, loss[loss=0.1701, simple_loss=0.2536, pruned_loss=0.04331, over 21197.00 frames. ], tot_loss[loss=0.1984, simple_loss=0.2735, pruned_loss=0.06162, over 4261938.45 frames. ], batch size: 548, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 03:17:38,965 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.39 vs. limit=22.5 2023-06-28 03:17:54,876 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1961436.0, ans=0.0 2023-06-28 03:18:11,448 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.07 vs. limit=15.0 2023-06-28 03:18:16,081 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.44 vs. limit=15.0 2023-06-28 03:18:23,016 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.683e+02 5.864e+02 6.968e+02 1.003e+03 1.764e+03, threshold=1.394e+03, percent-clipped=0.0 2023-06-28 03:18:35,994 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1961556.0, ans=0.025 2023-06-28 03:18:37,458 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1961556.0, ans=0.0 2023-06-28 03:18:40,990 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1961556.0, ans=0.125 2023-06-28 03:18:54,284 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1961616.0, ans=0.0 2023-06-28 03:18:57,977 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1961616.0, ans=0.0 2023-06-28 03:19:11,928 INFO [train.py:996] (2/4) Epoch 11, batch 22000, loss[loss=0.2126, simple_loss=0.2724, pruned_loss=0.07636, over 21354.00 frames. ], tot_loss[loss=0.1943, simple_loss=0.269, pruned_loss=0.05982, over 4264785.85 frames. ], batch size: 473, lr: 2.62e-03, grad_scale: 32.0 2023-06-28 03:19:30,201 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.44 vs. limit=15.0 2023-06-28 03:19:59,077 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1961796.0, ans=0.2 2023-06-28 03:20:32,001 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.31 vs. limit=15.0 2023-06-28 03:20:55,917 INFO [train.py:996] (2/4) Epoch 11, batch 22050, loss[loss=0.223, simple_loss=0.3092, pruned_loss=0.06846, over 21456.00 frames. ], tot_loss[loss=0.1967, simple_loss=0.2728, pruned_loss=0.06024, over 4252210.15 frames. ], batch size: 211, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 03:21:46,777 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1962096.0, ans=0.1 2023-06-28 03:21:53,077 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.146e+02 7.401e+02 1.317e+03 1.911e+03 4.599e+03, threshold=2.634e+03, percent-clipped=46.0 2023-06-28 03:21:59,588 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.96 vs. limit=15.0 2023-06-28 03:22:04,175 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1962156.0, ans=0.0 2023-06-28 03:22:40,211 INFO [train.py:996] (2/4) Epoch 11, batch 22100, loss[loss=0.2367, simple_loss=0.3071, pruned_loss=0.08313, over 21367.00 frames. ], tot_loss[loss=0.2082, simple_loss=0.2847, pruned_loss=0.06588, over 4263468.15 frames. ], batch size: 144, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 03:22:42,788 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1962276.0, ans=0.0 2023-06-28 03:22:45,957 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1962276.0, ans=0.125 2023-06-28 03:23:40,432 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1962456.0, ans=0.0 2023-06-28 03:24:17,381 INFO [train.py:996] (2/4) Epoch 11, batch 22150, loss[loss=0.2154, simple_loss=0.2846, pruned_loss=0.07316, over 21523.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.2877, pruned_loss=0.06752, over 4270895.26 frames. ], batch size: 548, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 03:24:53,063 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.12 vs. limit=15.0 2023-06-28 03:25:13,700 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.995e+02 8.783e+02 1.255e+03 1.849e+03 4.260e+03, threshold=2.511e+03, percent-clipped=3.0 2023-06-28 03:25:45,578 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1962816.0, ans=0.125 2023-06-28 03:26:00,157 INFO [train.py:996] (2/4) Epoch 11, batch 22200, loss[loss=0.1901, simple_loss=0.2589, pruned_loss=0.06061, over 21694.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.2894, pruned_loss=0.06814, over 4280272.41 frames. ], batch size: 263, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 03:26:00,883 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1962876.0, ans=0.04949747468305833 2023-06-28 03:26:19,593 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.89 vs. limit=6.0 2023-06-28 03:27:22,672 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1963056.0, ans=0.125 2023-06-28 03:27:31,005 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1963116.0, ans=0.125 2023-06-28 03:27:31,018 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1963116.0, ans=0.125 2023-06-28 03:27:37,771 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1963116.0, ans=0.125 2023-06-28 03:27:42,231 INFO [train.py:996] (2/4) Epoch 11, batch 22250, loss[loss=0.2264, simple_loss=0.3287, pruned_loss=0.06204, over 19864.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2935, pruned_loss=0.06877, over 4285282.11 frames. ], batch size: 703, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 03:28:07,107 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 03:28:20,552 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.80 vs. limit=15.0 2023-06-28 03:28:37,933 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.198e+02 6.711e+02 8.468e+02 1.239e+03 3.194e+03, threshold=1.694e+03, percent-clipped=5.0 2023-06-28 03:28:45,616 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=8.96 vs. limit=15.0 2023-06-28 03:28:46,601 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1963356.0, ans=0.0 2023-06-28 03:29:14,243 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1963416.0, ans=0.0 2023-06-28 03:29:28,291 INFO [train.py:996] (2/4) Epoch 11, batch 22300, loss[loss=0.243, simple_loss=0.3122, pruned_loss=0.08687, over 21717.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.2948, pruned_loss=0.07024, over 4289107.13 frames. ], batch size: 414, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 03:29:43,393 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1963476.0, ans=0.0 2023-06-28 03:30:04,055 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1963536.0, ans=0.125 2023-06-28 03:30:25,567 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1963656.0, ans=0.125 2023-06-28 03:30:29,010 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1963656.0, ans=0.2 2023-06-28 03:31:14,299 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=1963776.0, ans=15.0 2023-06-28 03:31:14,594 INFO [train.py:996] (2/4) Epoch 11, batch 22350, loss[loss=0.2565, simple_loss=0.3034, pruned_loss=0.1047, over 21803.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2932, pruned_loss=0.07105, over 4289684.51 frames. ], batch size: 508, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 03:31:15,224 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1963776.0, ans=0.5 2023-06-28 03:31:56,963 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 03:32:01,626 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.484e+02 6.278e+02 9.923e+02 1.351e+03 2.767e+03, threshold=1.985e+03, percent-clipped=14.0 2023-06-28 03:32:14,128 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1963956.0, ans=0.125 2023-06-28 03:32:21,256 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.41 vs. limit=22.5 2023-06-28 03:32:58,061 INFO [train.py:996] (2/4) Epoch 11, batch 22400, loss[loss=0.2087, simple_loss=0.2752, pruned_loss=0.0711, over 21476.00 frames. ], tot_loss[loss=0.2137, simple_loss=0.2903, pruned_loss=0.06859, over 4286610.61 frames. ], batch size: 194, lr: 2.62e-03, grad_scale: 32.0 2023-06-28 03:32:58,844 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1964076.0, ans=0.0 2023-06-28 03:34:08,214 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1964256.0, ans=0.125 2023-06-28 03:34:10,812 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.40 vs. limit=15.0 2023-06-28 03:34:40,477 INFO [train.py:996] (2/4) Epoch 11, batch 22450, loss[loss=0.1895, simple_loss=0.2517, pruned_loss=0.06361, over 21586.00 frames. ], tot_loss[loss=0.2096, simple_loss=0.2841, pruned_loss=0.06758, over 4288647.34 frames. ], batch size: 231, lr: 2.62e-03, grad_scale: 8.0 2023-06-28 03:34:57,768 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1964376.0, ans=0.125 2023-06-28 03:35:01,042 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1964436.0, ans=0.2 2023-06-28 03:35:35,825 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.371e+02 5.963e+02 8.267e+02 1.246e+03 2.225e+03, threshold=1.653e+03, percent-clipped=2.0 2023-06-28 03:35:48,211 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1964556.0, ans=0.0 2023-06-28 03:36:14,219 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1964616.0, ans=0.0 2023-06-28 03:36:24,031 INFO [train.py:996] (2/4) Epoch 11, batch 22500, loss[loss=0.2559, simple_loss=0.3584, pruned_loss=0.07672, over 21631.00 frames. ], tot_loss[loss=0.2059, simple_loss=0.2793, pruned_loss=0.06626, over 4272660.50 frames. ], batch size: 414, lr: 2.62e-03, grad_scale: 8.0 2023-06-28 03:36:53,521 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.83 vs. limit=22.5 2023-06-28 03:37:06,082 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1964796.0, ans=0.125 2023-06-28 03:38:07,199 INFO [train.py:996] (2/4) Epoch 11, batch 22550, loss[loss=0.1855, simple_loss=0.2645, pruned_loss=0.05319, over 21843.00 frames. ], tot_loss[loss=0.2089, simple_loss=0.2838, pruned_loss=0.06699, over 4277069.81 frames. ], batch size: 247, lr: 2.62e-03, grad_scale: 8.0 2023-06-28 03:38:08,452 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.46 vs. limit=15.0 2023-06-28 03:38:50,390 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1965096.0, ans=0.2 2023-06-28 03:39:03,147 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.60 vs. limit=15.0 2023-06-28 03:39:03,638 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.783e+02 6.887e+02 1.011e+03 1.935e+03 4.167e+03, threshold=2.022e+03, percent-clipped=31.0 2023-06-28 03:39:56,280 INFO [train.py:996] (2/4) Epoch 11, batch 22600, loss[loss=0.1719, simple_loss=0.2459, pruned_loss=0.04889, over 21779.00 frames. ], tot_loss[loss=0.2113, simple_loss=0.2879, pruned_loss=0.06741, over 4282650.02 frames. ], batch size: 112, lr: 2.62e-03, grad_scale: 8.0 2023-06-28 03:40:08,414 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1965276.0, ans=0.125 2023-06-28 03:40:33,344 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1965396.0, ans=0.125 2023-06-28 03:41:30,344 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1965516.0, ans=0.1 2023-06-28 03:41:32,182 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1965576.0, ans=0.09899494936611666 2023-06-28 03:41:33,172 INFO [train.py:996] (2/4) Epoch 11, batch 22650, loss[loss=0.2102, simple_loss=0.2748, pruned_loss=0.0728, over 21869.00 frames. ], tot_loss[loss=0.2087, simple_loss=0.2839, pruned_loss=0.06676, over 4279171.28 frames. ], batch size: 373, lr: 2.62e-03, grad_scale: 8.0 2023-06-28 03:41:59,600 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 03:42:06,096 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1965636.0, ans=0.1 2023-06-28 03:42:10,218 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.47 vs. limit=15.0 2023-06-28 03:42:26,674 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.890e+02 8.413e+02 1.340e+03 1.745e+03 3.098e+03, threshold=2.679e+03, percent-clipped=14.0 2023-06-28 03:42:30,123 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1965756.0, ans=0.125 2023-06-28 03:42:48,066 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1965756.0, ans=0.0 2023-06-28 03:42:54,879 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1965816.0, ans=0.035 2023-06-28 03:43:14,255 INFO [train.py:996] (2/4) Epoch 11, batch 22700, loss[loss=0.1706, simple_loss=0.2431, pruned_loss=0.04902, over 21742.00 frames. ], tot_loss[loss=0.2043, simple_loss=0.2773, pruned_loss=0.06568, over 4280038.75 frames. ], batch size: 124, lr: 2.62e-03, grad_scale: 8.0 2023-06-28 03:43:43,342 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1965936.0, ans=0.125 2023-06-28 03:43:49,019 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.97 vs. limit=22.5 2023-06-28 03:44:04,694 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1965996.0, ans=0.0 2023-06-28 03:44:06,760 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1965996.0, ans=0.0 2023-06-28 03:44:12,808 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1965996.0, ans=0.2 2023-06-28 03:44:14,469 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1965996.0, ans=0.125 2023-06-28 03:44:26,022 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1966056.0, ans=0.0 2023-06-28 03:44:47,807 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1966116.0, ans=0.0 2023-06-28 03:44:56,784 INFO [train.py:996] (2/4) Epoch 11, batch 22750, loss[loss=0.2359, simple_loss=0.3111, pruned_loss=0.08031, over 21426.00 frames. ], tot_loss[loss=0.2088, simple_loss=0.2813, pruned_loss=0.06809, over 4274896.64 frames. ], batch size: 211, lr: 2.62e-03, grad_scale: 8.0 2023-06-28 03:45:55,455 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.898e+02 9.181e+02 1.363e+03 2.029e+03 5.534e+03, threshold=2.727e+03, percent-clipped=14.0 2023-06-28 03:45:59,450 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1966356.0, ans=0.125 2023-06-28 03:46:10,057 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1966356.0, ans=0.125 2023-06-28 03:46:17,342 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.66 vs. limit=12.0 2023-06-28 03:46:38,630 INFO [train.py:996] (2/4) Epoch 11, batch 22800, loss[loss=0.2548, simple_loss=0.3093, pruned_loss=0.1002, over 21728.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2864, pruned_loss=0.07085, over 4280333.22 frames. ], batch size: 508, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 03:46:44,010 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1966476.0, ans=0.2 2023-06-28 03:46:51,432 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.61 vs. limit=15.0 2023-06-28 03:47:07,187 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1966536.0, ans=0.1 2023-06-28 03:48:20,560 INFO [train.py:996] (2/4) Epoch 11, batch 22850, loss[loss=0.2074, simple_loss=0.27, pruned_loss=0.07245, over 21624.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.282, pruned_loss=0.06944, over 4277485.35 frames. ], batch size: 247, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 03:48:42,514 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=1966836.0, ans=0.5 2023-06-28 03:49:19,986 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.398e+02 6.821e+02 8.997e+02 1.443e+03 3.960e+03, threshold=1.799e+03, percent-clipped=4.0 2023-06-28 03:49:29,218 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1966956.0, ans=0.2 2023-06-28 03:49:29,794 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.60 vs. limit=15.0 2023-06-28 03:50:04,231 INFO [train.py:996] (2/4) Epoch 11, batch 22900, loss[loss=0.2269, simple_loss=0.3156, pruned_loss=0.06914, over 21249.00 frames. ], tot_loss[loss=0.2105, simple_loss=0.2839, pruned_loss=0.06856, over 4278087.56 frames. ], batch size: 548, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 03:51:25,856 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1967256.0, ans=0.125 2023-06-28 03:51:30,798 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1967316.0, ans=0.05 2023-06-28 03:51:45,882 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1967316.0, ans=0.125 2023-06-28 03:51:48,483 INFO [train.py:996] (2/4) Epoch 11, batch 22950, loss[loss=0.2151, simple_loss=0.3269, pruned_loss=0.05165, over 21404.00 frames. ], tot_loss[loss=0.2156, simple_loss=0.2959, pruned_loss=0.06762, over 4279453.31 frames. ], batch size: 211, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 03:52:05,432 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1967376.0, ans=0.125 2023-06-28 03:52:16,277 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1967436.0, ans=0.0 2023-06-28 03:52:42,079 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.491e+02 7.317e+02 1.405e+03 2.219e+03 4.116e+03, threshold=2.810e+03, percent-clipped=42.0 2023-06-28 03:52:55,960 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1967556.0, ans=0.125 2023-06-28 03:53:25,451 INFO [train.py:996] (2/4) Epoch 11, batch 23000, loss[loss=0.2213, simple_loss=0.2964, pruned_loss=0.07315, over 21857.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.2959, pruned_loss=0.06614, over 4283506.81 frames. ], batch size: 371, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 03:54:09,817 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1967736.0, ans=0.07 2023-06-28 03:54:12,871 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 03:54:26,212 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1967796.0, ans=0.125 2023-06-28 03:54:46,370 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1967856.0, ans=0.0 2023-06-28 03:54:56,204 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1967916.0, ans=0.125 2023-06-28 03:54:57,683 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1967916.0, ans=10.0 2023-06-28 03:55:11,924 INFO [train.py:996] (2/4) Epoch 11, batch 23050, loss[loss=0.2272, simple_loss=0.3061, pruned_loss=0.07409, over 21449.00 frames. ], tot_loss[loss=0.2156, simple_loss=0.2967, pruned_loss=0.06725, over 4277052.44 frames. ], batch size: 131, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 03:55:20,244 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1967976.0, ans=0.125 2023-06-28 03:55:31,998 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1967976.0, ans=0.2 2023-06-28 03:56:02,461 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.806e+02 7.903e+02 1.210e+03 1.646e+03 4.576e+03, threshold=2.420e+03, percent-clipped=5.0 2023-06-28 03:56:12,650 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1968156.0, ans=0.035 2023-06-28 03:56:12,753 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1968156.0, ans=0.1 2023-06-28 03:56:26,005 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1968156.0, ans=0.0 2023-06-28 03:56:52,759 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=14.03 vs. limit=15.0 2023-06-28 03:56:53,620 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1968276.0, ans=0.0 2023-06-28 03:56:54,616 INFO [train.py:996] (2/4) Epoch 11, batch 23100, loss[loss=0.2079, simple_loss=0.2769, pruned_loss=0.06945, over 21812.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2925, pruned_loss=0.06808, over 4273497.54 frames. ], batch size: 118, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 03:58:36,209 INFO [train.py:996] (2/4) Epoch 11, batch 23150, loss[loss=0.2162, simple_loss=0.2906, pruned_loss=0.0709, over 20709.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.2877, pruned_loss=0.06735, over 4273714.43 frames. ], batch size: 607, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 03:58:42,366 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.77 vs. limit=15.0 2023-06-28 03:58:53,400 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1968636.0, ans=0.125 2023-06-28 03:58:59,743 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1968636.0, ans=0.0 2023-06-28 03:59:00,291 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.84 vs. limit=10.0 2023-06-28 03:59:03,064 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1968636.0, ans=0.0 2023-06-28 03:59:18,194 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1968696.0, ans=0.125 2023-06-28 03:59:20,957 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.155e+02 6.572e+02 9.609e+02 1.447e+03 3.666e+03, threshold=1.922e+03, percent-clipped=4.0 2023-06-28 03:59:48,485 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1968816.0, ans=0.125 2023-06-28 04:00:06,818 INFO [train.py:996] (2/4) Epoch 11, batch 23200, loss[loss=0.1958, simple_loss=0.2709, pruned_loss=0.06031, over 21375.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.2874, pruned_loss=0.06788, over 4274954.12 frames. ], batch size: 143, lr: 2.62e-03, grad_scale: 32.0 2023-06-28 04:00:07,672 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1968876.0, ans=0.2 2023-06-28 04:01:43,117 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 04:01:48,931 INFO [train.py:996] (2/4) Epoch 11, batch 23250, loss[loss=0.2135, simple_loss=0.2895, pruned_loss=0.06873, over 19943.00 frames. ], tot_loss[loss=0.212, simple_loss=0.2867, pruned_loss=0.06862, over 4278118.49 frames. ], batch size: 702, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 04:01:56,953 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.40 vs. limit=15.0 2023-06-28 04:02:14,242 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.06 vs. limit=15.0 2023-06-28 04:02:24,103 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1969296.0, ans=0.2 2023-06-28 04:02:42,554 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.956e+02 7.376e+02 1.130e+03 1.714e+03 3.374e+03, threshold=2.260e+03, percent-clipped=21.0 2023-06-28 04:02:46,558 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1969356.0, ans=0.0 2023-06-28 04:02:51,996 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1969356.0, ans=0.125 2023-06-28 04:03:02,506 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.99 vs. limit=15.0 2023-06-28 04:03:08,971 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1969416.0, ans=0.0 2023-06-28 04:03:18,609 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1969416.0, ans=0.0 2023-06-28 04:03:20,065 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1969416.0, ans=0.125 2023-06-28 04:03:34,413 INFO [train.py:996] (2/4) Epoch 11, batch 23300, loss[loss=0.1897, simple_loss=0.2557, pruned_loss=0.06183, over 21187.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2936, pruned_loss=0.0706, over 4281654.56 frames. ], batch size: 608, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 04:03:41,928 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1969476.0, ans=0.0 2023-06-28 04:04:02,172 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1969536.0, ans=0.1 2023-06-28 04:04:42,606 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=1969656.0, ans=0.95 2023-06-28 04:05:04,102 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1969716.0, ans=0.07 2023-06-28 04:05:07,183 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1969716.0, ans=0.2 2023-06-28 04:05:18,306 INFO [train.py:996] (2/4) Epoch 11, batch 23350, loss[loss=0.1572, simple_loss=0.2267, pruned_loss=0.04386, over 21899.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.2975, pruned_loss=0.07017, over 4285548.02 frames. ], batch size: 107, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 04:06:14,291 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.812e+02 6.998e+02 1.084e+03 1.696e+03 4.677e+03, threshold=2.169e+03, percent-clipped=9.0 2023-06-28 04:06:48,152 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.16 vs. limit=22.5 2023-06-28 04:07:00,209 INFO [train.py:996] (2/4) Epoch 11, batch 23400, loss[loss=0.2061, simple_loss=0.2813, pruned_loss=0.06542, over 21506.00 frames. ], tot_loss[loss=0.2122, simple_loss=0.2911, pruned_loss=0.06665, over 4286772.45 frames. ], batch size: 131, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 04:07:04,496 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1970076.0, ans=0.125 2023-06-28 04:07:14,963 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.81 vs. limit=10.0 2023-06-28 04:07:34,735 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.83 vs. limit=15.0 2023-06-28 04:07:45,536 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1970196.0, ans=0.05 2023-06-28 04:07:45,588 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 04:07:55,356 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1970196.0, ans=0.0 2023-06-28 04:07:55,466 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1970196.0, ans=0.125 2023-06-28 04:08:23,054 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1970256.0, ans=0.1 2023-06-28 04:08:42,821 INFO [train.py:996] (2/4) Epoch 11, batch 23450, loss[loss=0.2538, simple_loss=0.3158, pruned_loss=0.09588, over 21831.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2918, pruned_loss=0.06766, over 4285879.66 frames. ], batch size: 441, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 04:08:45,245 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1970376.0, ans=0.0 2023-06-28 04:09:02,433 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1970376.0, ans=0.125 2023-06-28 04:09:08,683 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1970436.0, ans=0.125 2023-06-28 04:09:31,804 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.87 vs. limit=12.0 2023-06-28 04:09:34,756 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1970496.0, ans=0.1 2023-06-28 04:09:39,102 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.169e+02 9.066e+02 1.305e+03 2.110e+03 3.921e+03, threshold=2.611e+03, percent-clipped=24.0 2023-06-28 04:09:58,018 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1970556.0, ans=0.125 2023-06-28 04:10:04,684 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1970616.0, ans=0.125 2023-06-28 04:10:20,270 INFO [train.py:996] (2/4) Epoch 11, batch 23500, loss[loss=0.2032, simple_loss=0.3, pruned_loss=0.0532, over 19990.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2938, pruned_loss=0.06933, over 4287642.14 frames. ], batch size: 702, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 04:10:44,121 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.63 vs. limit=22.5 2023-06-28 04:10:47,358 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1970736.0, ans=0.5 2023-06-28 04:11:07,277 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1970796.0, ans=0.0 2023-06-28 04:11:50,965 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1970916.0, ans=0.0 2023-06-28 04:11:56,889 INFO [train.py:996] (2/4) Epoch 11, batch 23550, loss[loss=0.2198, simple_loss=0.2597, pruned_loss=0.0899, over 21411.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2882, pruned_loss=0.06952, over 4286016.24 frames. ], batch size: 508, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 04:12:04,194 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1970976.0, ans=0.5 2023-06-28 04:12:16,971 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1971036.0, ans=0.025 2023-06-28 04:12:42,095 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.87 vs. limit=6.0 2023-06-28 04:12:56,969 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.515e+02 7.071e+02 9.804e+02 1.415e+03 2.782e+03, threshold=1.961e+03, percent-clipped=2.0 2023-06-28 04:13:04,914 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.80 vs. limit=15.0 2023-06-28 04:13:24,932 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1971216.0, ans=0.125 2023-06-28 04:13:33,843 INFO [train.py:996] (2/4) Epoch 11, batch 23600, loss[loss=0.2305, simple_loss=0.3154, pruned_loss=0.07283, over 21579.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.2885, pruned_loss=0.06956, over 4287918.77 frames. ], batch size: 414, lr: 2.62e-03, grad_scale: 32.0 2023-06-28 04:13:43,831 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.97 vs. limit=5.0 2023-06-28 04:13:46,340 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1971276.0, ans=0.025 2023-06-28 04:13:51,251 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1971276.0, ans=0.0 2023-06-28 04:14:06,395 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1971336.0, ans=0.2 2023-06-28 04:14:43,989 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1971456.0, ans=0.04949747468305833 2023-06-28 04:14:57,642 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1971456.0, ans=0.1 2023-06-28 04:14:59,731 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.21 vs. limit=15.0 2023-06-28 04:15:22,098 INFO [train.py:996] (2/4) Epoch 11, batch 23650, loss[loss=0.2283, simple_loss=0.3135, pruned_loss=0.07149, over 21488.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.2896, pruned_loss=0.06904, over 4288968.43 frames. ], batch size: 471, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 04:16:01,460 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 04:16:08,529 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.55 vs. limit=15.0 2023-06-28 04:16:25,960 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.038e+02 7.663e+02 1.286e+03 2.404e+03 4.690e+03, threshold=2.571e+03, percent-clipped=33.0 2023-06-28 04:17:08,778 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1971816.0, ans=0.04949747468305833 2023-06-28 04:17:11,313 INFO [train.py:996] (2/4) Epoch 11, batch 23700, loss[loss=0.2018, simple_loss=0.2813, pruned_loss=0.06121, over 21925.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.2906, pruned_loss=0.06784, over 4288019.09 frames. ], batch size: 317, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 04:17:12,569 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.47 vs. limit=15.0 2023-06-28 04:17:57,055 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1971996.0, ans=0.125 2023-06-28 04:17:59,105 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1971996.0, ans=0.07 2023-06-28 04:18:27,676 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1972056.0, ans=0.2 2023-06-28 04:18:31,707 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.40 vs. limit=15.0 2023-06-28 04:18:47,734 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1972116.0, ans=0.125 2023-06-28 04:18:55,617 INFO [train.py:996] (2/4) Epoch 11, batch 23750, loss[loss=0.1716, simple_loss=0.2714, pruned_loss=0.03596, over 21951.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2955, pruned_loss=0.06851, over 4278133.26 frames. ], batch size: 317, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 04:19:27,829 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.32 vs. limit=15.0 2023-06-28 04:19:59,571 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.432e+02 7.586e+02 1.231e+03 1.988e+03 4.114e+03, threshold=2.463e+03, percent-clipped=17.0 2023-06-28 04:20:49,619 INFO [train.py:996] (2/4) Epoch 11, batch 23800, loss[loss=0.1841, simple_loss=0.2641, pruned_loss=0.05208, over 21761.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.2937, pruned_loss=0.06695, over 4254925.82 frames. ], batch size: 124, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 04:20:53,892 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1972476.0, ans=0.2 2023-06-28 04:21:29,104 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1972596.0, ans=0.0 2023-06-28 04:21:40,958 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1972596.0, ans=0.125 2023-06-28 04:22:20,892 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1972716.0, ans=0.0 2023-06-28 04:22:30,803 INFO [train.py:996] (2/4) Epoch 11, batch 23850, loss[loss=0.2455, simple_loss=0.3164, pruned_loss=0.08729, over 21221.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.3002, pruned_loss=0.06932, over 4255454.34 frames. ], batch size: 143, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 04:22:38,479 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1972776.0, ans=0.1 2023-06-28 04:23:07,579 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 04:23:19,385 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1972896.0, ans=0.0 2023-06-28 04:23:30,393 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.838e+02 1.015e+03 1.727e+03 2.965e+03 4.931e+03, threshold=3.454e+03, percent-clipped=27.0 2023-06-28 04:23:32,658 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1972956.0, ans=0.1 2023-06-28 04:24:00,569 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1973016.0, ans=0.035 2023-06-28 04:24:14,931 INFO [train.py:996] (2/4) Epoch 11, batch 23900, loss[loss=0.2077, simple_loss=0.2785, pruned_loss=0.06848, over 21153.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.3061, pruned_loss=0.07164, over 4255677.34 frames. ], batch size: 159, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 04:24:27,098 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1973076.0, ans=0.1 2023-06-28 04:24:35,320 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1973136.0, ans=0.0 2023-06-28 04:24:36,885 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1973136.0, ans=0.025 2023-06-28 04:24:48,713 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1973136.0, ans=0.125 2023-06-28 04:25:25,081 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1973256.0, ans=0.125 2023-06-28 04:25:36,086 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1973256.0, ans=0.125 2023-06-28 04:25:57,206 INFO [train.py:996] (2/4) Epoch 11, batch 23950, loss[loss=0.2048, simple_loss=0.2719, pruned_loss=0.06888, over 21188.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.3005, pruned_loss=0.07046, over 4259287.24 frames. ], batch size: 143, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 04:26:03,045 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1973376.0, ans=0.125 2023-06-28 04:26:05,020 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1973376.0, ans=0.05 2023-06-28 04:26:05,704 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.65 vs. limit=22.5 2023-06-28 04:26:06,469 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1973376.0, ans=0.0 2023-06-28 04:26:26,509 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1973436.0, ans=0.1 2023-06-28 04:27:01,134 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.639e+02 7.884e+02 1.240e+03 1.758e+03 3.648e+03, threshold=2.481e+03, percent-clipped=1.0 2023-06-28 04:27:18,198 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1973556.0, ans=0.125 2023-06-28 04:27:39,692 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1973676.0, ans=0.1 2023-06-28 04:27:40,596 INFO [train.py:996] (2/4) Epoch 11, batch 24000, loss[loss=0.2334, simple_loss=0.3037, pruned_loss=0.08157, over 21460.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.3024, pruned_loss=0.07316, over 4267660.58 frames. ], batch size: 211, lr: 2.62e-03, grad_scale: 32.0 2023-06-28 04:27:40,596 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-28 04:27:55,992 INFO [zipformer.py:1728] (2/4) name=encoder.encoders.1.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([5.1331, 2.5946, 4.4720, 2.4702], device='cuda:2') 2023-06-28 04:28:01,240 INFO [train.py:1028] (2/4) Epoch 11, validation: loss=0.2606, simple_loss=0.3539, pruned_loss=0.08365, over 1796401.00 frames. 2023-06-28 04:28:01,241 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 23793MB 2023-06-28 04:29:31,679 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1973916.0, ans=0.125 2023-06-28 04:29:32,238 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.74 vs. limit=15.0 2023-06-28 04:29:45,822 INFO [train.py:996] (2/4) Epoch 11, batch 24050, loss[loss=0.1853, simple_loss=0.2776, pruned_loss=0.04652, over 21607.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.303, pruned_loss=0.07275, over 4265796.42 frames. ], batch size: 230, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 04:30:11,610 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1974036.0, ans=0.0 2023-06-28 04:30:16,604 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1974036.0, ans=0.2 2023-06-28 04:30:50,228 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.159e+02 7.180e+02 1.052e+03 1.636e+03 2.739e+03, threshold=2.104e+03, percent-clipped=1.0 2023-06-28 04:30:53,432 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.17 vs. limit=6.0 2023-06-28 04:31:11,256 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1974216.0, ans=0.125 2023-06-28 04:31:22,860 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1974216.0, ans=0.07 2023-06-28 04:31:33,818 INFO [train.py:996] (2/4) Epoch 11, batch 24100, loss[loss=0.2095, simple_loss=0.3034, pruned_loss=0.0578, over 21549.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.3016, pruned_loss=0.0713, over 4264206.32 frames. ], batch size: 230, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 04:32:20,196 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.80 vs. limit=15.0 2023-06-28 04:32:42,757 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1974456.0, ans=0.1 2023-06-28 04:33:14,909 INFO [train.py:996] (2/4) Epoch 11, batch 24150, loss[loss=0.1734, simple_loss=0.2148, pruned_loss=0.06602, over 20026.00 frames. ], tot_loss[loss=0.224, simple_loss=0.3021, pruned_loss=0.07299, over 4265237.30 frames. ], batch size: 703, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 04:33:43,735 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1974636.0, ans=0.125 2023-06-28 04:34:14,498 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.271e+02 8.001e+02 1.203e+03 1.842e+03 3.600e+03, threshold=2.405e+03, percent-clipped=13.0 2023-06-28 04:34:18,574 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1974756.0, ans=0.1 2023-06-28 04:34:57,398 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1974876.0, ans=0.2 2023-06-28 04:34:58,327 INFO [train.py:996] (2/4) Epoch 11, batch 24200, loss[loss=0.1931, simple_loss=0.2885, pruned_loss=0.04889, over 21704.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.3033, pruned_loss=0.07374, over 4270341.79 frames. ], batch size: 247, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 04:35:32,934 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.97 vs. limit=22.5 2023-06-28 04:36:36,379 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1975116.0, ans=0.0 2023-06-28 04:36:47,568 INFO [train.py:996] (2/4) Epoch 11, batch 24250, loss[loss=0.1997, simple_loss=0.2976, pruned_loss=0.05094, over 21793.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.301, pruned_loss=0.06924, over 4268491.73 frames. ], batch size: 282, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 04:37:26,032 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.88 vs. limit=22.5 2023-06-28 04:37:48,143 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.666e+02 6.261e+02 9.348e+02 1.527e+03 2.867e+03, threshold=1.870e+03, percent-clipped=6.0 2023-06-28 04:38:28,032 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1975416.0, ans=0.125 2023-06-28 04:38:35,071 INFO [train.py:996] (2/4) Epoch 11, batch 24300, loss[loss=0.1668, simple_loss=0.2583, pruned_loss=0.03771, over 21627.00 frames. ], tot_loss[loss=0.2117, simple_loss=0.2955, pruned_loss=0.0639, over 4267102.42 frames. ], batch size: 230, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 04:38:47,458 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1975476.0, ans=0.0 2023-06-28 04:39:06,384 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.52 vs. limit=15.0 2023-06-28 04:39:35,967 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.62 vs. limit=15.0 2023-06-28 04:40:15,785 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1975776.0, ans=0.125 2023-06-28 04:40:16,735 INFO [train.py:996] (2/4) Epoch 11, batch 24350, loss[loss=0.2307, simple_loss=0.3023, pruned_loss=0.07958, over 21241.00 frames. ], tot_loss[loss=0.2095, simple_loss=0.2925, pruned_loss=0.06325, over 4273325.57 frames. ], batch size: 143, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 04:40:36,555 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.75 vs. limit=15.0 2023-06-28 04:40:54,038 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1975896.0, ans=0.125 2023-06-28 04:41:03,946 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1975896.0, ans=0.125 2023-06-28 04:41:16,491 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.173e+02 7.216e+02 1.198e+03 1.667e+03 3.137e+03, threshold=2.397e+03, percent-clipped=16.0 2023-06-28 04:41:23,868 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1975956.0, ans=0.125 2023-06-28 04:41:53,445 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 04:41:59,537 INFO [train.py:996] (2/4) Epoch 11, batch 24400, loss[loss=0.2788, simple_loss=0.3448, pruned_loss=0.1065, over 21460.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2964, pruned_loss=0.06705, over 4280557.25 frames. ], batch size: 509, lr: 2.62e-03, grad_scale: 32.0 2023-06-28 04:43:13,484 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1976256.0, ans=0.09899494936611666 2023-06-28 04:43:42,614 INFO [train.py:996] (2/4) Epoch 11, batch 24450, loss[loss=0.2506, simple_loss=0.3487, pruned_loss=0.07623, over 21682.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.2984, pruned_loss=0.06809, over 4282595.00 frames. ], batch size: 298, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 04:43:43,470 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1976376.0, ans=0.0 2023-06-28 04:44:02,749 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1976436.0, ans=0.0 2023-06-28 04:44:48,222 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.546e+02 6.657e+02 8.727e+02 1.270e+03 2.887e+03, threshold=1.745e+03, percent-clipped=2.0 2023-06-28 04:44:48,788 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1976556.0, ans=0.0 2023-06-28 04:44:49,395 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.89 vs. limit=22.5 2023-06-28 04:45:24,295 INFO [train.py:996] (2/4) Epoch 11, batch 24500, loss[loss=0.195, simple_loss=0.2819, pruned_loss=0.05409, over 21162.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.2968, pruned_loss=0.06773, over 4286311.10 frames. ], batch size: 143, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 04:46:55,822 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1976916.0, ans=0.1 2023-06-28 04:47:07,087 INFO [train.py:996] (2/4) Epoch 11, batch 24550, loss[loss=0.2668, simple_loss=0.3509, pruned_loss=0.09134, over 21835.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.3, pruned_loss=0.0701, over 4283573.84 frames. ], batch size: 124, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 04:47:27,759 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1977036.0, ans=0.125 2023-06-28 04:48:18,394 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.753e+02 7.977e+02 1.391e+03 1.923e+03 3.873e+03, threshold=2.782e+03, percent-clipped=31.0 2023-06-28 04:48:20,481 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1977156.0, ans=0.125 2023-06-28 04:48:27,390 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1977156.0, ans=0.125 2023-06-28 04:48:54,457 INFO [train.py:996] (2/4) Epoch 11, batch 24600, loss[loss=0.1809, simple_loss=0.2516, pruned_loss=0.05514, over 21269.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.2974, pruned_loss=0.07063, over 4280411.46 frames. ], batch size: 176, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 04:49:00,195 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1977276.0, ans=0.1 2023-06-28 04:49:06,387 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1977276.0, ans=0.1 2023-06-28 04:49:26,276 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1977336.0, ans=0.0 2023-06-28 04:49:49,275 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1977396.0, ans=0.2 2023-06-28 04:50:09,185 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1977456.0, ans=0.125 2023-06-28 04:50:37,066 INFO [train.py:996] (2/4) Epoch 11, batch 24650, loss[loss=0.1905, simple_loss=0.2528, pruned_loss=0.06407, over 21095.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.2899, pruned_loss=0.06888, over 4281134.24 frames. ], batch size: 143, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 04:51:04,226 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1977636.0, ans=0.1 2023-06-28 04:51:07,467 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1977636.0, ans=0.125 2023-06-28 04:51:14,708 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.65 vs. limit=15.0 2023-06-28 04:51:17,113 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1977636.0, ans=0.125 2023-06-28 04:51:42,515 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.777e+02 8.360e+02 1.097e+03 1.550e+03 2.969e+03, threshold=2.194e+03, percent-clipped=2.0 2023-06-28 04:52:19,288 INFO [train.py:996] (2/4) Epoch 11, batch 24700, loss[loss=0.2401, simple_loss=0.2941, pruned_loss=0.09312, over 21437.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.287, pruned_loss=0.0681, over 4267780.50 frames. ], batch size: 509, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 04:53:16,503 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.50 vs. limit=15.0 2023-06-28 04:54:01,977 INFO [train.py:996] (2/4) Epoch 11, batch 24750, loss[loss=0.1926, simple_loss=0.2567, pruned_loss=0.06423, over 21371.00 frames. ], tot_loss[loss=0.2069, simple_loss=0.2815, pruned_loss=0.06618, over 4262573.06 frames. ], batch size: 160, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 04:54:25,961 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1978236.0, ans=0.0 2023-06-28 04:54:29,301 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1978236.0, ans=0.125 2023-06-28 04:55:07,374 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.038e+02 5.891e+02 8.003e+02 1.099e+03 2.127e+03, threshold=1.601e+03, percent-clipped=0.0 2023-06-28 04:55:38,506 INFO [train.py:996] (2/4) Epoch 11, batch 24800, loss[loss=0.2084, simple_loss=0.2766, pruned_loss=0.07008, over 21836.00 frames. ], tot_loss[loss=0.2037, simple_loss=0.2766, pruned_loss=0.06541, over 4253908.42 frames. ], batch size: 298, lr: 2.61e-03, grad_scale: 32.0 2023-06-28 04:55:45,896 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1978476.0, ans=0.125 2023-06-28 04:56:22,970 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1978596.0, ans=0.035 2023-06-28 04:56:54,587 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1978656.0, ans=0.125 2023-06-28 04:57:19,488 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1978716.0, ans=0.2 2023-06-28 04:57:22,249 INFO [train.py:996] (2/4) Epoch 11, batch 24850, loss[loss=0.2812, simple_loss=0.3491, pruned_loss=0.1066, over 21557.00 frames. ], tot_loss[loss=0.2051, simple_loss=0.277, pruned_loss=0.06663, over 4261861.87 frames. ], batch size: 508, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 04:57:24,590 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1978776.0, ans=0.0 2023-06-28 04:58:01,058 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1978836.0, ans=0.125 2023-06-28 04:58:07,406 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1978836.0, ans=0.125 2023-06-28 04:58:29,646 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.09 vs. limit=15.0 2023-06-28 04:58:35,356 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.508e+02 8.527e+02 1.164e+03 1.873e+03 3.084e+03, threshold=2.328e+03, percent-clipped=28.0 2023-06-28 04:58:39,201 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1978956.0, ans=0.2 2023-06-28 04:58:59,141 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1979016.0, ans=0.125 2023-06-28 04:59:09,781 INFO [train.py:996] (2/4) Epoch 11, batch 24900, loss[loss=0.1789, simple_loss=0.2331, pruned_loss=0.06229, over 21343.00 frames. ], tot_loss[loss=0.2069, simple_loss=0.279, pruned_loss=0.06742, over 4267526.42 frames. ], batch size: 176, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 04:59:12,965 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.26 vs. limit=15.0 2023-06-28 04:59:30,349 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1979136.0, ans=0.125 2023-06-28 04:59:45,154 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1979136.0, ans=0.5 2023-06-28 04:59:45,251 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1979136.0, ans=0.0 2023-06-28 05:00:01,635 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1979196.0, ans=0.1 2023-06-28 05:00:37,358 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1979316.0, ans=0.05 2023-06-28 05:00:57,848 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1979376.0, ans=0.0 2023-06-28 05:00:58,717 INFO [train.py:996] (2/4) Epoch 11, batch 24950, loss[loss=0.2653, simple_loss=0.3341, pruned_loss=0.0982, over 21821.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2878, pruned_loss=0.07127, over 4269963.56 frames. ], batch size: 441, lr: 2.61e-03, grad_scale: 8.0 2023-06-28 05:02:04,369 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.368e+02 8.687e+02 1.291e+03 2.049e+03 3.753e+03, threshold=2.582e+03, percent-clipped=19.0 2023-06-28 05:02:19,624 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1979616.0, ans=0.125 2023-06-28 05:02:42,789 INFO [train.py:996] (2/4) Epoch 11, batch 25000, loss[loss=0.1881, simple_loss=0.2604, pruned_loss=0.05797, over 21379.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.2928, pruned_loss=0.07234, over 4269001.81 frames. ], batch size: 211, lr: 2.61e-03, grad_scale: 8.0 2023-06-28 05:02:50,043 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1979676.0, ans=0.125 2023-06-28 05:02:59,966 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1979676.0, ans=0.125 2023-06-28 05:03:11,979 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1979736.0, ans=0.0 2023-06-28 05:03:13,445 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1979736.0, ans=0.125 2023-06-28 05:03:26,988 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1979796.0, ans=0.0 2023-06-28 05:03:38,831 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1979796.0, ans=0.125 2023-06-28 05:04:04,837 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1979916.0, ans=10.0 2023-06-28 05:04:16,342 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1979916.0, ans=0.0 2023-06-28 05:04:25,870 INFO [train.py:996] (2/4) Epoch 11, batch 25050, loss[loss=0.1609, simple_loss=0.2326, pruned_loss=0.04459, over 21481.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.2871, pruned_loss=0.07057, over 4269094.73 frames. ], batch size: 212, lr: 2.61e-03, grad_scale: 8.0 2023-06-28 05:04:40,237 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.51 vs. limit=10.0 2023-06-28 05:04:41,524 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1979976.0, ans=0.125 2023-06-28 05:05:11,677 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1980096.0, ans=0.0 2023-06-28 05:05:37,080 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.870e+02 6.206e+02 8.703e+02 1.312e+03 2.418e+03, threshold=1.741e+03, percent-clipped=0.0 2023-06-28 05:05:45,892 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1980156.0, ans=0.015 2023-06-28 05:06:03,822 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1980216.0, ans=0.1 2023-06-28 05:06:09,897 INFO [train.py:996] (2/4) Epoch 11, batch 25100, loss[loss=0.2075, simple_loss=0.3033, pruned_loss=0.05581, over 21753.00 frames. ], tot_loss[loss=0.2108, simple_loss=0.2831, pruned_loss=0.06921, over 4255291.36 frames. ], batch size: 316, lr: 2.61e-03, grad_scale: 8.0 2023-06-28 05:06:10,944 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.35 vs. limit=15.0 2023-06-28 05:06:20,281 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1980276.0, ans=0.125 2023-06-28 05:06:36,516 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1980336.0, ans=0.07 2023-06-28 05:06:43,083 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1980336.0, ans=0.1 2023-06-28 05:06:48,399 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.28 vs. limit=15.0 2023-06-28 05:06:56,254 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1980396.0, ans=0.125 2023-06-28 05:07:51,401 INFO [train.py:996] (2/4) Epoch 11, batch 25150, loss[loss=0.2077, simple_loss=0.2903, pruned_loss=0.06249, over 21222.00 frames. ], tot_loss[loss=0.2091, simple_loss=0.2842, pruned_loss=0.067, over 4262865.74 frames. ], batch size: 159, lr: 2.61e-03, grad_scale: 8.0 2023-06-28 05:08:55,624 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.446e+02 6.553e+02 1.065e+03 1.530e+03 2.529e+03, threshold=2.131e+03, percent-clipped=15.0 2023-06-28 05:09:17,649 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1980816.0, ans=0.0 2023-06-28 05:09:28,758 INFO [train.py:996] (2/4) Epoch 11, batch 25200, loss[loss=0.2071, simple_loss=0.3046, pruned_loss=0.0548, over 21757.00 frames. ], tot_loss[loss=0.2074, simple_loss=0.2835, pruned_loss=0.06561, over 4268577.81 frames. ], batch size: 332, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:10:01,860 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1980936.0, ans=0.125 2023-06-28 05:10:13,636 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1980936.0, ans=0.1 2023-06-28 05:10:16,569 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1980996.0, ans=0.1 2023-06-28 05:10:26,577 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1980996.0, ans=0.1 2023-06-28 05:10:53,575 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.43 vs. limit=10.0 2023-06-28 05:11:10,874 INFO [train.py:996] (2/4) Epoch 11, batch 25250, loss[loss=0.1915, simple_loss=0.2619, pruned_loss=0.06059, over 21594.00 frames. ], tot_loss[loss=0.2054, simple_loss=0.2813, pruned_loss=0.06475, over 4271554.11 frames. ], batch size: 247, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:11:21,250 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1981176.0, ans=0.125 2023-06-28 05:11:44,035 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1981236.0, ans=0.0 2023-06-28 05:11:52,626 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1981236.0, ans=0.125 2023-06-28 05:11:54,202 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1981236.0, ans=0.125 2023-06-28 05:12:19,246 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.09 vs. limit=6.0 2023-06-28 05:12:21,397 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.599e+02 7.557e+02 1.172e+03 1.779e+03 3.738e+03, threshold=2.344e+03, percent-clipped=14.0 2023-06-28 05:12:39,213 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.73 vs. limit=22.5 2023-06-28 05:12:59,836 INFO [train.py:996] (2/4) Epoch 11, batch 25300, loss[loss=0.2108, simple_loss=0.2892, pruned_loss=0.0662, over 21506.00 frames. ], tot_loss[loss=0.2038, simple_loss=0.2803, pruned_loss=0.06364, over 4262060.35 frames. ], batch size: 194, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:13:02,773 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.18 vs. limit=15.0 2023-06-28 05:13:56,010 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1981596.0, ans=0.0 2023-06-28 05:14:40,100 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1981716.0, ans=0.125 2023-06-28 05:14:44,513 INFO [train.py:996] (2/4) Epoch 11, batch 25350, loss[loss=0.2136, simple_loss=0.2766, pruned_loss=0.07525, over 20005.00 frames. ], tot_loss[loss=0.2044, simple_loss=0.282, pruned_loss=0.06334, over 4255956.20 frames. ], batch size: 703, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:15:20,812 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1981836.0, ans=0.1 2023-06-28 05:15:27,902 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1981896.0, ans=0.2 2023-06-28 05:15:53,137 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.507e+02 7.550e+02 1.200e+03 1.857e+03 4.350e+03, threshold=2.399e+03, percent-clipped=14.0 2023-06-28 05:16:25,271 INFO [train.py:996] (2/4) Epoch 11, batch 25400, loss[loss=0.2446, simple_loss=0.3049, pruned_loss=0.09211, over 21362.00 frames. ], tot_loss[loss=0.2026, simple_loss=0.2795, pruned_loss=0.06284, over 4260825.34 frames. ], batch size: 471, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:16:58,806 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.05 vs. limit=15.0 2023-06-28 05:17:06,306 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1982196.0, ans=0.125 2023-06-28 05:17:34,996 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1982256.0, ans=0.2 2023-06-28 05:18:07,455 INFO [train.py:996] (2/4) Epoch 11, batch 25450, loss[loss=0.2253, simple_loss=0.313, pruned_loss=0.06881, over 21677.00 frames. ], tot_loss[loss=0.2038, simple_loss=0.2801, pruned_loss=0.06378, over 4266381.60 frames. ], batch size: 441, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:18:16,320 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1982376.0, ans=0.2 2023-06-28 05:19:17,818 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.365e+02 6.795e+02 1.021e+03 1.795e+03 3.141e+03, threshold=2.041e+03, percent-clipped=7.0 2023-06-28 05:19:21,709 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1982556.0, ans=0.1 2023-06-28 05:19:56,330 INFO [train.py:996] (2/4) Epoch 11, batch 25500, loss[loss=0.2519, simple_loss=0.3322, pruned_loss=0.08576, over 21204.00 frames. ], tot_loss[loss=0.2017, simple_loss=0.2808, pruned_loss=0.06136, over 4256663.53 frames. ], batch size: 143, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:20:02,869 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.01 vs. limit=22.5 2023-06-28 05:20:21,887 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1982736.0, ans=0.1 2023-06-28 05:21:10,790 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1982856.0, ans=0.0 2023-06-28 05:21:39,923 INFO [train.py:996] (2/4) Epoch 11, batch 25550, loss[loss=0.195, simple_loss=0.2619, pruned_loss=0.06406, over 16389.00 frames. ], tot_loss[loss=0.205, simple_loss=0.287, pruned_loss=0.06147, over 4251565.57 frames. ], batch size: 62, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:21:40,990 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.55 vs. limit=12.0 2023-06-28 05:21:56,727 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1982976.0, ans=0.125 2023-06-28 05:22:32,052 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.79 vs. limit=22.5 2023-06-28 05:22:44,326 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.429e+02 7.335e+02 1.015e+03 1.599e+03 3.312e+03, threshold=2.031e+03, percent-clipped=14.0 2023-06-28 05:23:28,331 INFO [train.py:996] (2/4) Epoch 11, batch 25600, loss[loss=0.2108, simple_loss=0.2924, pruned_loss=0.06457, over 19847.00 frames. ], tot_loss[loss=0.2068, simple_loss=0.2905, pruned_loss=0.06155, over 4244103.80 frames. ], batch size: 702, lr: 2.61e-03, grad_scale: 32.0 2023-06-28 05:23:43,142 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.81 vs. limit=15.0 2023-06-28 05:25:10,636 INFO [train.py:996] (2/4) Epoch 11, batch 25650, loss[loss=0.26, simple_loss=0.397, pruned_loss=0.06148, over 19678.00 frames. ], tot_loss[loss=0.209, simple_loss=0.2914, pruned_loss=0.06333, over 4246390.08 frames. ], batch size: 702, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:25:12,697 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1983576.0, ans=0.125 2023-06-28 05:25:12,771 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1983576.0, ans=0.125 2023-06-28 05:26:01,019 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1983696.0, ans=0.015 2023-06-28 05:26:21,288 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.706e+02 6.784e+02 1.002e+03 1.536e+03 3.689e+03, threshold=2.004e+03, percent-clipped=11.0 2023-06-28 05:26:48,910 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.36 vs. limit=15.0 2023-06-28 05:26:52,928 INFO [train.py:996] (2/4) Epoch 11, batch 25700, loss[loss=0.1916, simple_loss=0.264, pruned_loss=0.05959, over 21887.00 frames. ], tot_loss[loss=0.2088, simple_loss=0.288, pruned_loss=0.06473, over 4248953.77 frames. ], batch size: 316, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:27:54,809 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1984056.0, ans=0.125 2023-06-28 05:28:03,608 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1984056.0, ans=0.125 2023-06-28 05:28:32,206 INFO [train.py:996] (2/4) Epoch 11, batch 25750, loss[loss=0.203, simple_loss=0.278, pruned_loss=0.06403, over 20736.00 frames. ], tot_loss[loss=0.2137, simple_loss=0.2927, pruned_loss=0.06737, over 4257155.99 frames. ], batch size: 608, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:29:12,766 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1984296.0, ans=0.125 2023-06-28 05:29:50,525 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.299e+02 8.292e+02 1.215e+03 2.235e+03 4.745e+03, threshold=2.430e+03, percent-clipped=27.0 2023-06-28 05:30:23,503 INFO [train.py:996] (2/4) Epoch 11, batch 25800, loss[loss=0.2388, simple_loss=0.3437, pruned_loss=0.06694, over 20734.00 frames. ], tot_loss[loss=0.2248, simple_loss=0.3061, pruned_loss=0.07172, over 4256820.85 frames. ], batch size: 607, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:31:19,546 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.38 vs. limit=8.0 2023-06-28 05:31:51,756 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1984716.0, ans=0.125 2023-06-28 05:32:02,224 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.92 vs. limit=15.0 2023-06-28 05:32:05,253 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1984776.0, ans=0.09899494936611666 2023-06-28 05:32:06,397 INFO [train.py:996] (2/4) Epoch 11, batch 25850, loss[loss=0.205, simple_loss=0.2862, pruned_loss=0.06193, over 21781.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.3075, pruned_loss=0.07165, over 4255475.51 frames. ], batch size: 247, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:32:30,808 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.66 vs. limit=15.0 2023-06-28 05:32:46,847 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1984836.0, ans=0.125 2023-06-28 05:33:18,847 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.240e+02 7.750e+02 1.095e+03 1.413e+03 4.702e+03, threshold=2.190e+03, percent-clipped=3.0 2023-06-28 05:33:19,519 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1984956.0, ans=0.04949747468305833 2023-06-28 05:33:24,699 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1984956.0, ans=0.0 2023-06-28 05:33:45,968 INFO [train.py:996] (2/4) Epoch 11, batch 25900, loss[loss=0.2439, simple_loss=0.3401, pruned_loss=0.07389, over 21844.00 frames. ], tot_loss[loss=0.2266, simple_loss=0.3089, pruned_loss=0.07217, over 4264030.68 frames. ], batch size: 316, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:34:41,860 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1985196.0, ans=0.0 2023-06-28 05:35:00,239 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=1985256.0, ans=0.05 2023-06-28 05:35:29,677 INFO [train.py:996] (2/4) Epoch 11, batch 25950, loss[loss=0.2427, simple_loss=0.3264, pruned_loss=0.07948, over 21791.00 frames. ], tot_loss[loss=0.231, simple_loss=0.3139, pruned_loss=0.07409, over 4268890.21 frames. ], batch size: 124, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:36:25,469 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1985496.0, ans=0.1 2023-06-28 05:36:27,265 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1985496.0, ans=0.125 2023-06-28 05:36:38,159 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.88 vs. limit=15.0 2023-06-28 05:36:39,263 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1985556.0, ans=0.125 2023-06-28 05:36:41,751 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.536e+02 7.393e+02 8.893e+02 1.407e+03 4.224e+03, threshold=1.779e+03, percent-clipped=8.0 2023-06-28 05:37:01,412 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1985616.0, ans=0.0 2023-06-28 05:37:18,834 INFO [train.py:996] (2/4) Epoch 11, batch 26000, loss[loss=0.2088, simple_loss=0.2957, pruned_loss=0.06093, over 21319.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.313, pruned_loss=0.07293, over 4268863.89 frames. ], batch size: 176, lr: 2.61e-03, grad_scale: 32.0 2023-06-28 05:37:44,685 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1985736.0, ans=0.125 2023-06-28 05:37:50,954 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1985736.0, ans=0.035 2023-06-28 05:38:08,187 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.26 vs. limit=22.5 2023-06-28 05:38:11,136 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1985796.0, ans=0.125 2023-06-28 05:38:15,960 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1985796.0, ans=0.0 2023-06-28 05:38:42,890 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.12 vs. limit=15.0 2023-06-28 05:39:00,975 INFO [train.py:996] (2/4) Epoch 11, batch 26050, loss[loss=0.2214, simple_loss=0.2871, pruned_loss=0.07789, over 21276.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3117, pruned_loss=0.07292, over 4272114.55 frames. ], batch size: 143, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:39:05,299 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.70 vs. limit=10.0 2023-06-28 05:39:37,180 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1986096.0, ans=0.125 2023-06-28 05:40:03,338 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.740e+02 6.927e+02 9.303e+02 1.315e+03 2.564e+03, threshold=1.861e+03, percent-clipped=11.0 2023-06-28 05:40:06,338 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.65 vs. limit=12.0 2023-06-28 05:40:28,211 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1986216.0, ans=0.0 2023-06-28 05:40:37,549 INFO [train.py:996] (2/4) Epoch 11, batch 26100, loss[loss=0.1877, simple_loss=0.2574, pruned_loss=0.05899, over 21837.00 frames. ], tot_loss[loss=0.226, simple_loss=0.3067, pruned_loss=0.07263, over 4270696.42 frames. ], batch size: 247, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:41:20,005 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1986396.0, ans=0.125 2023-06-28 05:41:41,377 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.00 vs. limit=6.0 2023-06-28 05:42:00,389 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.57 vs. limit=15.0 2023-06-28 05:42:21,829 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.47 vs. limit=15.0 2023-06-28 05:42:25,420 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.72 vs. limit=22.5 2023-06-28 05:42:25,872 INFO [train.py:996] (2/4) Epoch 11, batch 26150, loss[loss=0.2375, simple_loss=0.31, pruned_loss=0.08245, over 21794.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.3039, pruned_loss=0.07335, over 4278201.62 frames. ], batch size: 441, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:42:47,416 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=1986636.0, ans=15.0 2023-06-28 05:42:59,056 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1986636.0, ans=0.0 2023-06-28 05:43:07,828 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1986696.0, ans=0.125 2023-06-28 05:43:23,280 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1986756.0, ans=0.0 2023-06-28 05:43:40,783 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.470e+02 6.839e+02 9.051e+02 1.314e+03 2.834e+03, threshold=1.810e+03, percent-clipped=6.0 2023-06-28 05:44:10,859 INFO [train.py:996] (2/4) Epoch 11, batch 26200, loss[loss=0.2363, simple_loss=0.3405, pruned_loss=0.06608, over 21599.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.3051, pruned_loss=0.07169, over 4282680.65 frames. ], batch size: 389, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:44:38,526 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1986936.0, ans=0.1 2023-06-28 05:45:15,802 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.60 vs. limit=12.0 2023-06-28 05:45:33,200 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1987116.0, ans=0.125 2023-06-28 05:45:49,439 INFO [train.py:996] (2/4) Epoch 11, batch 26250, loss[loss=0.2156, simple_loss=0.2922, pruned_loss=0.06952, over 21251.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.3082, pruned_loss=0.0705, over 4289777.22 frames. ], batch size: 176, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:46:21,362 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1987236.0, ans=0.125 2023-06-28 05:47:01,914 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.966e+02 7.302e+02 1.108e+03 1.607e+03 4.168e+03, threshold=2.217e+03, percent-clipped=19.0 2023-06-28 05:47:23,965 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1987416.0, ans=0.125 2023-06-28 05:47:31,745 INFO [train.py:996] (2/4) Epoch 11, batch 26300, loss[loss=0.2243, simple_loss=0.303, pruned_loss=0.07279, over 21471.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.3045, pruned_loss=0.07142, over 4296542.90 frames. ], batch size: 131, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:47:47,518 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1987476.0, ans=0.125 2023-06-28 05:47:48,927 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1987476.0, ans=0.125 2023-06-28 05:48:30,300 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1987596.0, ans=0.0 2023-06-28 05:48:49,419 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.74 vs. limit=6.0 2023-06-28 05:48:50,465 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 05:49:12,618 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.87 vs. limit=15.0 2023-06-28 05:49:19,392 INFO [train.py:996] (2/4) Epoch 11, batch 26350, loss[loss=0.236, simple_loss=0.3125, pruned_loss=0.07981, over 21741.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.3034, pruned_loss=0.07198, over 4296052.69 frames. ], batch size: 332, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:49:31,811 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1987776.0, ans=0.2 2023-06-28 05:50:32,392 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.942e+02 8.987e+02 1.115e+03 1.521e+03 3.466e+03, threshold=2.231e+03, percent-clipped=6.0 2023-06-28 05:50:38,058 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1987956.0, ans=0.2 2023-06-28 05:50:54,400 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=1988016.0, ans=0.05 2023-06-28 05:51:02,139 INFO [train.py:996] (2/4) Epoch 11, batch 26400, loss[loss=0.1953, simple_loss=0.2606, pruned_loss=0.06498, over 21810.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.2977, pruned_loss=0.0719, over 4294943.66 frames. ], batch size: 352, lr: 2.61e-03, grad_scale: 32.0 2023-06-28 05:51:16,491 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1988076.0, ans=0.0 2023-06-28 05:51:31,619 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1988136.0, ans=0.0 2023-06-28 05:51:55,419 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1988196.0, ans=0.125 2023-06-28 05:51:59,381 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1988196.0, ans=0.0 2023-06-28 05:52:06,479 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1988196.0, ans=0.025 2023-06-28 05:52:10,064 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1988256.0, ans=0.125 2023-06-28 05:52:31,795 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1988316.0, ans=0.125 2023-06-28 05:52:42,282 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1988316.0, ans=0.2 2023-06-28 05:52:48,901 INFO [train.py:996] (2/4) Epoch 11, batch 26450, loss[loss=0.2424, simple_loss=0.3456, pruned_loss=0.06956, over 21642.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.2979, pruned_loss=0.07173, over 4284968.35 frames. ], batch size: 389, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:52:56,012 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1988376.0, ans=0.2 2023-06-28 05:53:31,020 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1988436.0, ans=0.2 2023-06-28 05:53:57,071 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.50 vs. limit=15.0 2023-06-28 05:54:09,263 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.671e+02 1.024e+03 1.650e+03 2.442e+03 4.564e+03, threshold=3.300e+03, percent-clipped=28.0 2023-06-28 05:54:35,915 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.54 vs. limit=15.0 2023-06-28 05:54:37,913 INFO [train.py:996] (2/4) Epoch 11, batch 26500, loss[loss=0.2533, simple_loss=0.3428, pruned_loss=0.08196, over 21714.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.2981, pruned_loss=0.07039, over 4272913.59 frames. ], batch size: 414, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:55:13,251 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1988736.0, ans=0.1 2023-06-28 05:56:00,122 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1988856.0, ans=0.1 2023-06-28 05:56:23,194 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.44 vs. limit=15.0 2023-06-28 05:56:28,885 INFO [train.py:996] (2/4) Epoch 11, batch 26550, loss[loss=0.2495, simple_loss=0.3367, pruned_loss=0.0811, over 21520.00 frames. ], tot_loss[loss=0.2164, simple_loss=0.2964, pruned_loss=0.06821, over 4269350.42 frames. ], batch size: 508, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:57:06,675 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1989096.0, ans=0.0 2023-06-28 05:57:36,044 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1989156.0, ans=0.1 2023-06-28 05:57:38,607 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.448e+02 7.811e+02 1.294e+03 2.097e+03 4.356e+03, threshold=2.588e+03, percent-clipped=4.0 2023-06-28 05:57:55,644 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.74 vs. limit=12.0 2023-06-28 05:58:07,944 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1989216.0, ans=0.125 2023-06-28 05:58:10,601 INFO [train.py:996] (2/4) Epoch 11, batch 26600, loss[loss=0.1827, simple_loss=0.2508, pruned_loss=0.05729, over 20747.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2964, pruned_loss=0.06543, over 4263775.29 frames. ], batch size: 608, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:59:08,362 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1989396.0, ans=0.125 2023-06-28 05:59:52,623 INFO [train.py:996] (2/4) Epoch 11, batch 26650, loss[loss=0.1547, simple_loss=0.2323, pruned_loss=0.0386, over 21542.00 frames. ], tot_loss[loss=0.2082, simple_loss=0.2886, pruned_loss=0.06388, over 4262229.32 frames. ], batch size: 195, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:59:59,865 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1989576.0, ans=0.125 2023-06-28 06:00:47,220 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1989696.0, ans=0.0 2023-06-28 06:00:53,948 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1989756.0, ans=0.125 2023-06-28 06:01:03,226 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1989756.0, ans=0.125 2023-06-28 06:01:05,985 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.789e+02 5.280e+02 6.792e+02 8.539e+02 2.170e+03, threshold=1.358e+03, percent-clipped=0.0 2023-06-28 06:01:33,808 INFO [train.py:996] (2/4) Epoch 11, batch 26700, loss[loss=0.1808, simple_loss=0.249, pruned_loss=0.0563, over 21199.00 frames. ], tot_loss[loss=0.202, simple_loss=0.2816, pruned_loss=0.06123, over 4261852.69 frames. ], batch size: 176, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 06:01:57,434 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1989936.0, ans=0.125 2023-06-28 06:02:01,128 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1989936.0, ans=0.125 2023-06-28 06:02:11,161 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1989996.0, ans=0.1 2023-06-28 06:02:16,699 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1989996.0, ans=0.125 2023-06-28 06:02:31,763 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1989996.0, ans=0.0 2023-06-28 06:02:44,925 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1990056.0, ans=0.125 2023-06-28 06:03:18,032 INFO [train.py:996] (2/4) Epoch 11, batch 26750, loss[loss=0.2396, simple_loss=0.3203, pruned_loss=0.07943, over 21724.00 frames. ], tot_loss[loss=0.2021, simple_loss=0.283, pruned_loss=0.06053, over 4275711.55 frames. ], batch size: 441, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 06:03:18,630 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1990176.0, ans=0.07 2023-06-28 06:04:33,826 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.602e+02 7.194e+02 1.094e+03 1.684e+03 4.507e+03, threshold=2.188e+03, percent-clipped=37.0 2023-06-28 06:04:35,463 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.45 vs. limit=15.0 2023-06-28 06:05:02,089 INFO [train.py:996] (2/4) Epoch 11, batch 26800, loss[loss=0.2039, simple_loss=0.2765, pruned_loss=0.06571, over 20023.00 frames. ], tot_loss[loss=0.2101, simple_loss=0.2909, pruned_loss=0.06467, over 4275800.52 frames. ], batch size: 703, lr: 2.61e-03, grad_scale: 32.0 2023-06-28 06:05:09,478 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1990476.0, ans=0.0 2023-06-28 06:05:18,015 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1990536.0, ans=0.125 2023-06-28 06:05:35,535 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1990536.0, ans=10.0 2023-06-28 06:06:10,196 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1990656.0, ans=0.2 2023-06-28 06:06:10,281 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1990656.0, ans=0.125 2023-06-28 06:06:12,406 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.92 vs. limit=22.5 2023-06-28 06:06:19,754 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1990656.0, ans=0.0 2023-06-28 06:06:43,231 INFO [train.py:996] (2/4) Epoch 11, batch 26850, loss[loss=0.1927, simple_loss=0.2509, pruned_loss=0.0673, over 21693.00 frames. ], tot_loss[loss=0.2137, simple_loss=0.2922, pruned_loss=0.06759, over 4282111.52 frames. ], batch size: 282, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 06:07:37,712 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.35 vs. limit=15.0 2023-06-28 06:07:40,438 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1990896.0, ans=0.125 2023-06-28 06:07:47,738 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.91 vs. limit=15.0 2023-06-28 06:08:02,685 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.773e+02 7.985e+02 1.116e+03 1.630e+03 3.577e+03, threshold=2.232e+03, percent-clipped=14.0 2023-06-28 06:08:22,113 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1991016.0, ans=0.125 2023-06-28 06:08:24,666 INFO [train.py:996] (2/4) Epoch 11, batch 26900, loss[loss=0.2065, simple_loss=0.2753, pruned_loss=0.06885, over 21913.00 frames. ], tot_loss[loss=0.2094, simple_loss=0.2845, pruned_loss=0.06714, over 4279507.22 frames. ], batch size: 125, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 06:08:57,072 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1991136.0, ans=0.2 2023-06-28 06:10:02,676 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1991316.0, ans=0.0 2023-06-28 06:10:05,642 INFO [train.py:996] (2/4) Epoch 11, batch 26950, loss[loss=0.2086, simple_loss=0.2968, pruned_loss=0.06018, over 21742.00 frames. ], tot_loss[loss=0.2089, simple_loss=0.2839, pruned_loss=0.06695, over 4277346.48 frames. ], batch size: 351, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 06:10:21,249 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1991436.0, ans=0.0 2023-06-28 06:10:22,839 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1991436.0, ans=0.2 2023-06-28 06:10:39,977 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1991436.0, ans=0.95 2023-06-28 06:11:14,835 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1991556.0, ans=0.1 2023-06-28 06:11:24,663 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1991556.0, ans=0.0 2023-06-28 06:11:27,132 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.642e+02 6.578e+02 9.395e+02 1.272e+03 2.979e+03, threshold=1.879e+03, percent-clipped=1.0 2023-06-28 06:11:32,887 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1991616.0, ans=0.0 2023-06-28 06:11:47,977 INFO [train.py:996] (2/4) Epoch 11, batch 27000, loss[loss=0.2567, simple_loss=0.3375, pruned_loss=0.08788, over 21453.00 frames. ], tot_loss[loss=0.2072, simple_loss=0.2847, pruned_loss=0.0649, over 4277322.47 frames. ], batch size: 508, lr: 2.60e-03, grad_scale: 8.0 2023-06-28 06:11:47,977 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-28 06:11:59,995 INFO [zipformer.py:1728] (2/4) name=encoder.encoders.0.layers.1.self_attn_weights, attn_weights_entropy = tensor([5.6637, 5.1168, 5.3969, 4.8329], device='cuda:2') 2023-06-28 06:12:09,383 INFO [train.py:1028] (2/4) Epoch 11, validation: loss=0.246, simple_loss=0.3377, pruned_loss=0.07718, over 1796401.00 frames. 2023-06-28 06:12:09,384 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 23793MB 2023-06-28 06:12:32,588 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1991736.0, ans=0.125 2023-06-28 06:12:33,151 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.80 vs. limit=10.0 2023-06-28 06:12:34,136 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1991736.0, ans=0.125 2023-06-28 06:12:42,927 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.77 vs. limit=15.0 2023-06-28 06:13:23,961 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1991856.0, ans=0.1 2023-06-28 06:13:32,596 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_na.min_abs, batch_count=1991916.0, ans=0.02 2023-06-28 06:13:48,936 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.03 vs. limit=12.0 2023-06-28 06:13:57,806 INFO [train.py:996] (2/4) Epoch 11, batch 27050, loss[loss=0.2749, simple_loss=0.3364, pruned_loss=0.1068, over 21634.00 frames. ], tot_loss[loss=0.207, simple_loss=0.2884, pruned_loss=0.06283, over 4278931.06 frames. ], batch size: 507, lr: 2.60e-03, grad_scale: 8.0 2023-06-28 06:14:02,089 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.90 vs. limit=15.0 2023-06-28 06:14:03,230 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1991976.0, ans=0.125 2023-06-28 06:14:21,951 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1992036.0, ans=0.0 2023-06-28 06:15:06,827 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1992156.0, ans=0.2 2023-06-28 06:15:07,776 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.395e+02 5.810e+02 8.142e+02 1.096e+03 2.681e+03, threshold=1.628e+03, percent-clipped=6.0 2023-06-28 06:15:36,425 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1992276.0, ans=0.2 2023-06-28 06:15:37,372 INFO [train.py:996] (2/4) Epoch 11, batch 27100, loss[loss=0.2078, simple_loss=0.3098, pruned_loss=0.05289, over 21638.00 frames. ], tot_loss[loss=0.2069, simple_loss=0.2886, pruned_loss=0.06264, over 4281189.73 frames. ], batch size: 230, lr: 2.60e-03, grad_scale: 8.0 2023-06-28 06:15:50,869 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1992276.0, ans=0.0 2023-06-28 06:16:13,899 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.02 vs. limit=15.0 2023-06-28 06:16:14,823 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1992336.0, ans=0.125 2023-06-28 06:16:22,155 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1992396.0, ans=0.125 2023-06-28 06:16:24,083 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1992396.0, ans=0.0 2023-06-28 06:16:45,038 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1992456.0, ans=0.0 2023-06-28 06:16:49,868 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1992456.0, ans=0.035 2023-06-28 06:16:53,439 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1992456.0, ans=0.07 2023-06-28 06:16:54,769 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1992516.0, ans=0.125 2023-06-28 06:17:22,567 INFO [train.py:996] (2/4) Epoch 11, batch 27150, loss[loss=0.2913, simple_loss=0.3885, pruned_loss=0.09706, over 21262.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.3011, pruned_loss=0.06609, over 4283489.65 frames. ], batch size: 548, lr: 2.60e-03, grad_scale: 8.0 2023-06-28 06:17:27,993 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1992576.0, ans=0.125 2023-06-28 06:18:14,450 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1992696.0, ans=0.125 2023-06-28 06:18:27,834 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1992756.0, ans=0.2 2023-06-28 06:18:35,177 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.727e+02 8.827e+02 1.451e+03 2.136e+03 4.044e+03, threshold=2.902e+03, percent-clipped=43.0 2023-06-28 06:18:41,385 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.42 vs. limit=15.0 2023-06-28 06:18:59,851 INFO [train.py:996] (2/4) Epoch 11, batch 27200, loss[loss=0.2679, simple_loss=0.3429, pruned_loss=0.09644, over 21590.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.3083, pruned_loss=0.06861, over 4283715.81 frames. ], batch size: 414, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 06:19:00,746 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1992876.0, ans=0.2 2023-06-28 06:19:17,316 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1992876.0, ans=0.0 2023-06-28 06:19:19,612 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.44 vs. limit=15.0 2023-06-28 06:19:42,977 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1992996.0, ans=0.1 2023-06-28 06:19:44,805 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1992996.0, ans=0.125 2023-06-28 06:20:49,572 INFO [train.py:996] (2/4) Epoch 11, batch 27250, loss[loss=0.2225, simple_loss=0.2871, pruned_loss=0.07894, over 20038.00 frames. ], tot_loss[loss=0.2266, simple_loss=0.31, pruned_loss=0.07162, over 4284158.32 frames. ], batch size: 703, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 06:20:58,917 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1993176.0, ans=0.2 2023-06-28 06:21:50,456 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.17 vs. limit=6.0 2023-06-28 06:21:53,330 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1993296.0, ans=0.1 2023-06-28 06:22:14,914 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.213e+02 7.322e+02 9.525e+02 1.331e+03 3.028e+03, threshold=1.905e+03, percent-clipped=1.0 2023-06-28 06:22:20,807 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1993416.0, ans=0.0 2023-06-28 06:22:34,521 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1993476.0, ans=0.1 2023-06-28 06:22:35,623 INFO [train.py:996] (2/4) Epoch 11, batch 27300, loss[loss=0.2068, simple_loss=0.2995, pruned_loss=0.05706, over 21955.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.3108, pruned_loss=0.07251, over 4278312.19 frames. ], batch size: 317, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 06:23:37,059 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1993596.0, ans=0.0 2023-06-28 06:23:57,361 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1993656.0, ans=0.125 2023-06-28 06:24:20,012 INFO [train.py:996] (2/4) Epoch 11, batch 27350, loss[loss=0.2195, simple_loss=0.3004, pruned_loss=0.06928, over 21678.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.3133, pruned_loss=0.07375, over 4271579.28 frames. ], batch size: 389, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 06:24:20,532 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1993776.0, ans=0.125 2023-06-28 06:24:23,877 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1993776.0, ans=0.125 2023-06-28 06:24:57,624 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1993836.0, ans=0.2 2023-06-28 06:25:40,530 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1993956.0, ans=0.125 2023-06-28 06:25:41,648 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.836e+02 8.634e+02 1.277e+03 1.695e+03 3.535e+03, threshold=2.554e+03, percent-clipped=18.0 2023-06-28 06:26:01,626 INFO [train.py:996] (2/4) Epoch 11, batch 27400, loss[loss=0.2063, simple_loss=0.2715, pruned_loss=0.07058, over 21329.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.309, pruned_loss=0.07328, over 4270826.86 frames. ], batch size: 176, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 06:26:02,358 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1994076.0, ans=0.2 2023-06-28 06:26:49,347 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1994196.0, ans=0.1 2023-06-28 06:26:56,332 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1994196.0, ans=0.0 2023-06-28 06:27:44,106 INFO [train.py:996] (2/4) Epoch 11, batch 27450, loss[loss=0.207, simple_loss=0.2962, pruned_loss=0.05892, over 21736.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.3016, pruned_loss=0.07141, over 4263122.78 frames. ], batch size: 282, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 06:27:58,761 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1994376.0, ans=0.0 2023-06-28 06:28:14,780 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1994436.0, ans=0.1 2023-06-28 06:28:18,133 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1994436.0, ans=0.125 2023-06-28 06:28:28,355 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.42 vs. limit=15.0 2023-06-28 06:29:05,151 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.109e+02 6.770e+02 1.005e+03 1.545e+03 3.220e+03, threshold=2.009e+03, percent-clipped=5.0 2023-06-28 06:29:25,931 INFO [train.py:996] (2/4) Epoch 11, batch 27500, loss[loss=0.2276, simple_loss=0.2875, pruned_loss=0.08389, over 21578.00 frames. ], tot_loss[loss=0.2214, simple_loss=0.3, pruned_loss=0.07136, over 4269602.45 frames. ], batch size: 548, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 06:30:55,848 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1994916.0, ans=0.0 2023-06-28 06:31:16,361 INFO [train.py:996] (2/4) Epoch 11, batch 27550, loss[loss=0.2834, simple_loss=0.398, pruned_loss=0.08445, over 19874.00 frames. ], tot_loss[loss=0.216, simple_loss=0.295, pruned_loss=0.06848, over 4268104.42 frames. ], batch size: 702, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 06:31:28,713 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1994976.0, ans=0.0 2023-06-28 06:31:40,028 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1995036.0, ans=0.5 2023-06-28 06:31:52,030 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.36 vs. limit=15.0 2023-06-28 06:32:02,819 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1995096.0, ans=0.125 2023-06-28 06:32:15,998 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1995156.0, ans=0.125 2023-06-28 06:32:24,034 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1995156.0, ans=0.0 2023-06-28 06:32:28,475 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.464e+02 6.626e+02 9.640e+02 1.426e+03 2.852e+03, threshold=1.928e+03, percent-clipped=10.0 2023-06-28 06:32:37,295 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1995216.0, ans=0.2 2023-06-28 06:32:53,133 INFO [train.py:996] (2/4) Epoch 11, batch 27600, loss[loss=0.2032, simple_loss=0.2733, pruned_loss=0.06655, over 21803.00 frames. ], tot_loss[loss=0.2105, simple_loss=0.2871, pruned_loss=0.06694, over 4274610.73 frames. ], batch size: 102, lr: 2.60e-03, grad_scale: 32.0 2023-06-28 06:33:28,149 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1995336.0, ans=0.125 2023-06-28 06:34:26,381 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1995516.0, ans=0.125 2023-06-28 06:34:30,774 INFO [train.py:996] (2/4) Epoch 11, batch 27650, loss[loss=0.1855, simple_loss=0.2528, pruned_loss=0.05912, over 21220.00 frames. ], tot_loss[loss=0.2081, simple_loss=0.2824, pruned_loss=0.06686, over 4258948.23 frames. ], batch size: 176, lr: 2.60e-03, grad_scale: 32.0 2023-06-28 06:35:06,517 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1995636.0, ans=0.125 2023-06-28 06:35:43,517 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.03 vs. limit=15.0 2023-06-28 06:35:54,453 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.549e+02 8.580e+02 1.335e+03 1.838e+03 2.881e+03, threshold=2.670e+03, percent-clipped=20.0 2023-06-28 06:36:16,758 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1995876.0, ans=0.125 2023-06-28 06:36:17,808 INFO [train.py:996] (2/4) Epoch 11, batch 27700, loss[loss=0.2238, simple_loss=0.315, pruned_loss=0.06628, over 21788.00 frames. ], tot_loss[loss=0.2067, simple_loss=0.2834, pruned_loss=0.06505, over 4266592.14 frames. ], batch size: 316, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 06:37:47,418 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1996116.0, ans=0.125 2023-06-28 06:38:04,200 INFO [train.py:996] (2/4) Epoch 11, batch 27750, loss[loss=0.1989, simple_loss=0.2861, pruned_loss=0.0559, over 21704.00 frames. ], tot_loss[loss=0.2096, simple_loss=0.2883, pruned_loss=0.06543, over 4269340.45 frames. ], batch size: 389, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 06:38:28,825 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1996236.0, ans=0.0 2023-06-28 06:38:46,816 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1996296.0, ans=0.125 2023-06-28 06:39:07,642 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1996356.0, ans=0.125 2023-06-28 06:39:18,759 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.799e+02 7.463e+02 1.002e+03 1.388e+03 2.774e+03, threshold=2.003e+03, percent-clipped=1.0 2023-06-28 06:39:39,536 INFO [train.py:996] (2/4) Epoch 11, batch 27800, loss[loss=0.2328, simple_loss=0.2992, pruned_loss=0.08318, over 21648.00 frames. ], tot_loss[loss=0.21, simple_loss=0.2874, pruned_loss=0.06626, over 4276328.80 frames. ], batch size: 471, lr: 2.60e-03, grad_scale: 8.0 2023-06-28 06:40:51,197 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.79 vs. limit=15.0 2023-06-28 06:40:51,223 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.70 vs. limit=15.0 2023-06-28 06:40:55,514 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1996656.0, ans=0.125 2023-06-28 06:41:26,070 INFO [train.py:996] (2/4) Epoch 11, batch 27850, loss[loss=0.2166, simple_loss=0.2944, pruned_loss=0.06942, over 21829.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.2868, pruned_loss=0.06769, over 4280476.45 frames. ], batch size: 118, lr: 2.60e-03, grad_scale: 8.0 2023-06-28 06:42:49,899 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.063e+02 7.124e+02 9.695e+02 1.441e+03 2.660e+03, threshold=1.939e+03, percent-clipped=8.0 2023-06-28 06:42:50,923 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1997016.0, ans=0.125 2023-06-28 06:43:16,102 INFO [train.py:996] (2/4) Epoch 11, batch 27900, loss[loss=0.2339, simple_loss=0.3296, pruned_loss=0.06912, over 21841.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.2967, pruned_loss=0.06957, over 4272210.56 frames. ], batch size: 316, lr: 2.60e-03, grad_scale: 8.0 2023-06-28 06:43:53,886 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1997196.0, ans=0.0 2023-06-28 06:44:35,620 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1997256.0, ans=0.2 2023-06-28 06:44:48,101 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.91 vs. limit=15.0 2023-06-28 06:44:54,231 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1997316.0, ans=0.125 2023-06-28 06:44:57,007 INFO [train.py:996] (2/4) Epoch 11, batch 27950, loss[loss=0.1625, simple_loss=0.2522, pruned_loss=0.03635, over 21419.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.2951, pruned_loss=0.06568, over 4277119.79 frames. ], batch size: 194, lr: 2.60e-03, grad_scale: 8.0 2023-06-28 06:45:02,998 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.83 vs. limit=15.0 2023-06-28 06:45:04,034 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1997376.0, ans=0.125 2023-06-28 06:45:09,278 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1997376.0, ans=0.0 2023-06-28 06:45:26,273 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1997436.0, ans=0.2 2023-06-28 06:46:23,005 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.171e+02 6.116e+02 8.584e+02 1.262e+03 3.314e+03, threshold=1.717e+03, percent-clipped=6.0 2023-06-28 06:46:31,871 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1997616.0, ans=0.125 2023-06-28 06:46:39,425 INFO [train.py:996] (2/4) Epoch 11, batch 28000, loss[loss=0.2087, simple_loss=0.2872, pruned_loss=0.06514, over 21875.00 frames. ], tot_loss[loss=0.2109, simple_loss=0.2931, pruned_loss=0.06435, over 4280871.00 frames. ], batch size: 351, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 06:47:06,750 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1997736.0, ans=0.1 2023-06-28 06:47:20,207 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1997796.0, ans=0.125 2023-06-28 06:47:53,638 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1997856.0, ans=0.125 2023-06-28 06:48:12,338 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1997916.0, ans=0.125 2023-06-28 06:48:20,484 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1997916.0, ans=0.1 2023-06-28 06:48:23,192 INFO [train.py:996] (2/4) Epoch 11, batch 28050, loss[loss=0.2156, simple_loss=0.2976, pruned_loss=0.0668, over 21839.00 frames. ], tot_loss[loss=0.2105, simple_loss=0.2904, pruned_loss=0.06533, over 4283280.72 frames. ], batch size: 371, lr: 2.60e-03, grad_scale: 8.0 2023-06-28 06:49:00,130 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.63 vs. limit=15.0 2023-06-28 06:49:23,244 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1998096.0, ans=0.0 2023-06-28 06:49:50,470 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.773e+02 7.059e+02 1.070e+03 1.534e+03 3.837e+03, threshold=2.141e+03, percent-clipped=19.0 2023-06-28 06:50:01,081 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1998216.0, ans=0.125 2023-06-28 06:50:05,476 INFO [train.py:996] (2/4) Epoch 11, batch 28100, loss[loss=0.2205, simple_loss=0.2784, pruned_loss=0.08129, over 21502.00 frames. ], tot_loss[loss=0.2113, simple_loss=0.2912, pruned_loss=0.06566, over 4284652.16 frames. ], batch size: 441, lr: 2.60e-03, grad_scale: 8.0 2023-06-28 06:50:19,009 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1998276.0, ans=0.125 2023-06-28 06:50:34,126 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_na.min_abs, batch_count=1998336.0, ans=0.02 2023-06-28 06:51:42,279 INFO [train.py:996] (2/4) Epoch 11, batch 28150, loss[loss=0.1898, simple_loss=0.2506, pruned_loss=0.06455, over 21296.00 frames. ], tot_loss[loss=0.2077, simple_loss=0.2847, pruned_loss=0.06532, over 4286756.52 frames. ], batch size: 144, lr: 2.60e-03, grad_scale: 8.0 2023-06-28 06:52:44,365 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.19 vs. limit=15.0 2023-06-28 06:52:48,945 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1998756.0, ans=0.125 2023-06-28 06:53:04,713 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.152e+02 7.382e+02 1.011e+03 1.548e+03 3.347e+03, threshold=2.022e+03, percent-clipped=11.0 2023-06-28 06:53:19,838 INFO [train.py:996] (2/4) Epoch 11, batch 28200, loss[loss=0.2213, simple_loss=0.2954, pruned_loss=0.07362, over 21694.00 frames. ], tot_loss[loss=0.207, simple_loss=0.2823, pruned_loss=0.06589, over 4274265.89 frames. ], batch size: 332, lr: 2.60e-03, grad_scale: 8.0 2023-06-28 06:53:24,832 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.65 vs. limit=15.0 2023-06-28 06:54:16,522 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.07 vs. limit=10.0 2023-06-28 06:54:17,464 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1998996.0, ans=0.125 2023-06-28 06:54:19,323 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 06:54:33,543 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1999056.0, ans=0.125 2023-06-28 06:54:58,090 INFO [train.py:996] (2/4) Epoch 11, batch 28250, loss[loss=0.2088, simple_loss=0.2781, pruned_loss=0.06969, over 21750.00 frames. ], tot_loss[loss=0.2105, simple_loss=0.2847, pruned_loss=0.06815, over 4276429.07 frames. ], batch size: 351, lr: 2.60e-03, grad_scale: 8.0 2023-06-28 06:55:38,852 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1999236.0, ans=0.07 2023-06-28 06:56:17,706 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.47 vs. limit=15.0 2023-06-28 06:56:18,675 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 06:56:21,467 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.911e+02 7.590e+02 1.013e+03 1.851e+03 3.926e+03, threshold=2.026e+03, percent-clipped=15.0 2023-06-28 06:56:37,226 INFO [train.py:996] (2/4) Epoch 11, batch 28300, loss[loss=0.1655, simple_loss=0.2561, pruned_loss=0.0375, over 21638.00 frames. ], tot_loss[loss=0.2078, simple_loss=0.2833, pruned_loss=0.0661, over 4252061.39 frames. ], batch size: 263, lr: 2.60e-03, grad_scale: 8.0 2023-06-28 06:58:15,200 INFO [train.py:996] (2/4) Epoch 11, batch 28350, loss[loss=0.1737, simple_loss=0.2879, pruned_loss=0.02977, over 20900.00 frames. ], tot_loss[loss=0.201, simple_loss=0.2799, pruned_loss=0.06107, over 4251850.81 frames. ], batch size: 608, lr: 2.60e-03, grad_scale: 8.0 2023-06-28 06:58:48,628 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1999836.0, ans=0.125 2023-06-28 06:58:54,126 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.08 vs. limit=22.5 2023-06-28 06:59:14,927 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1999896.0, ans=0.1 2023-06-28 06:59:37,857 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.816e+02 8.082e+02 1.144e+03 1.595e+03 4.896e+03, threshold=2.288e+03, percent-clipped=16.0 2023-06-28 06:59:57,604 INFO [train.py:996] (2/4) Epoch 11, batch 28400, loss[loss=0.2101, simple_loss=0.2819, pruned_loss=0.06914, over 21704.00 frames. ], tot_loss[loss=0.1993, simple_loss=0.2756, pruned_loss=0.06148, over 4259364.39 frames. ], batch size: 332, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 07:00:08,842 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=2000076.0, ans=15.0 2023-06-28 07:00:28,792 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.86 vs. limit=22.5 2023-06-28 07:01:01,509 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=2000256.0, ans=0.2 2023-06-28 07:01:24,811 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=2000316.0, ans=0.07 2023-06-28 07:01:29,973 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=2000316.0, ans=0.025 2023-06-28 07:01:40,573 INFO [train.py:996] (2/4) Epoch 11, batch 28450, loss[loss=0.2408, simple_loss=0.3212, pruned_loss=0.08018, over 21880.00 frames. ], tot_loss[loss=0.2057, simple_loss=0.2816, pruned_loss=0.06493, over 4261244.93 frames. ], batch size: 118, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 07:01:49,401 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2000376.0, ans=0.1 2023-06-28 07:01:54,284 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=2000376.0, ans=0.2 2023-06-28 07:02:00,327 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=2000376.0, ans=0.125 2023-06-28 07:02:31,755 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 07:02:49,889 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.06 vs. limit=22.5 2023-06-28 07:03:03,715 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.045e+02 8.706e+02 1.350e+03 2.003e+03 3.584e+03, threshold=2.700e+03, percent-clipped=15.0 2023-06-28 07:03:19,220 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 07:03:28,165 INFO [train.py:996] (2/4) Epoch 11, batch 28500, loss[loss=0.2467, simple_loss=0.3192, pruned_loss=0.08715, over 21184.00 frames. ], tot_loss[loss=0.2089, simple_loss=0.2839, pruned_loss=0.0669, over 4266198.47 frames. ], batch size: 143, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 07:03:39,714 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.77 vs. limit=15.0 2023-06-28 07:03:45,449 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2000676.0, ans=0.0 2023-06-28 07:04:30,303 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.54 vs. limit=12.0 2023-06-28 07:05:10,765 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=2000976.0, ans=0.05 2023-06-28 07:05:11,631 INFO [train.py:996] (2/4) Epoch 11, batch 28550, loss[loss=0.2353, simple_loss=0.3344, pruned_loss=0.06809, over 21269.00 frames. ], tot_loss[loss=0.2149, simple_loss=0.2913, pruned_loss=0.06921, over 4272641.07 frames. ], batch size: 176, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 07:05:23,872 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2000976.0, ans=0.125 2023-06-28 07:05:32,561 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.74 vs. limit=15.0 2023-06-28 07:06:41,151 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.280e+02 7.153e+02 1.164e+03 1.649e+03 3.101e+03, threshold=2.329e+03, percent-clipped=2.0 2023-06-28 07:06:59,352 INFO [train.py:996] (2/4) Epoch 11, batch 28600, loss[loss=0.2377, simple_loss=0.3132, pruned_loss=0.08113, over 21564.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.2983, pruned_loss=0.07136, over 4274551.98 frames. ], batch size: 389, lr: 2.60e-03, grad_scale: 8.0 2023-06-28 07:07:20,474 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 07:07:36,804 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2001396.0, ans=0.125 2023-06-28 07:08:08,846 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=2001456.0, ans=0.07 2023-06-28 07:08:41,522 INFO [train.py:996] (2/4) Epoch 11, batch 28650, loss[loss=0.1715, simple_loss=0.237, pruned_loss=0.053, over 21516.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2929, pruned_loss=0.07062, over 4277734.73 frames. ], batch size: 213, lr: 2.60e-03, grad_scale: 8.0 2023-06-28 07:09:17,680 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2001696.0, ans=0.125 2023-06-28 07:09:26,066 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2001696.0, ans=0.0 2023-06-28 07:09:47,873 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2001756.0, ans=0.125 2023-06-28 07:10:06,741 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.860e+02 6.883e+02 1.055e+03 1.706e+03 3.634e+03, threshold=2.110e+03, percent-clipped=9.0 2023-06-28 07:10:13,789 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2001816.0, ans=0.125 2023-06-28 07:10:13,896 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2001816.0, ans=0.125 2023-06-28 07:10:19,942 INFO [train.py:996] (2/4) Epoch 11, batch 28700, loss[loss=0.2663, simple_loss=0.3324, pruned_loss=0.1001, over 21462.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.2923, pruned_loss=0.07175, over 4274785.42 frames. ], batch size: 471, lr: 2.60e-03, grad_scale: 8.0 2023-06-28 07:10:32,347 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=2001876.0, ans=0.07 2023-06-28 07:11:25,029 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2002056.0, ans=0.0 2023-06-28 07:12:02,751 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.07 vs. limit=12.0 2023-06-28 07:12:03,061 INFO [train.py:996] (2/4) Epoch 11, batch 28750, loss[loss=0.199, simple_loss=0.2958, pruned_loss=0.05112, over 21727.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.2921, pruned_loss=0.07161, over 4281012.84 frames. ], batch size: 389, lr: 2.60e-03, grad_scale: 8.0 2023-06-28 07:12:07,791 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.91 vs. limit=12.0 2023-06-28 07:12:38,773 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2002236.0, ans=0.125 2023-06-28 07:13:33,316 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.047e+02 7.903e+02 1.280e+03 1.957e+03 3.313e+03, threshold=2.559e+03, percent-clipped=20.0 2023-06-28 07:13:39,833 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.73 vs. limit=10.0 2023-06-28 07:13:46,643 INFO [train.py:996] (2/4) Epoch 11, batch 28800, loss[loss=0.2806, simple_loss=0.3525, pruned_loss=0.1044, over 21306.00 frames. ], tot_loss[loss=0.219, simple_loss=0.2948, pruned_loss=0.07161, over 4276796.47 frames. ], batch size: 159, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 07:14:24,935 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=2002536.0, ans=0.0 2023-06-28 07:14:34,149 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=2002596.0, ans=0.125 2023-06-28 07:15:28,358 INFO [train.py:996] (2/4) Epoch 11, batch 28850, loss[loss=0.181, simple_loss=0.2362, pruned_loss=0.06295, over 20267.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.2957, pruned_loss=0.07223, over 4278917.89 frames. ], batch size: 702, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 07:16:18,484 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2002896.0, ans=0.125 2023-06-28 07:16:47,756 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=2002956.0, ans=0.0 2023-06-28 07:16:58,888 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.940e+02 7.188e+02 1.079e+03 1.563e+03 3.306e+03, threshold=2.159e+03, percent-clipped=4.0 2023-06-28 07:16:59,513 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2003016.0, ans=0.0 2023-06-28 07:17:04,777 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2003016.0, ans=0.125 2023-06-28 07:17:12,885 INFO [train.py:996] (2/4) Epoch 11, batch 28900, loss[loss=0.2863, simple_loss=0.3565, pruned_loss=0.1081, over 21575.00 frames. ], tot_loss[loss=0.222, simple_loss=0.2975, pruned_loss=0.07322, over 4281098.39 frames. ], batch size: 508, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 07:17:30,931 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2003076.0, ans=0.125 2023-06-28 07:19:05,100 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2003376.0, ans=0.125 2023-06-28 07:19:06,030 INFO [train.py:996] (2/4) Epoch 11, batch 28950, loss[loss=0.2031, simple_loss=0.2923, pruned_loss=0.0569, over 21844.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.3006, pruned_loss=0.0736, over 4283296.03 frames. ], batch size: 371, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 07:19:25,770 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.66 vs. limit=15.0 2023-06-28 07:19:55,806 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=2003496.0, ans=0.125 2023-06-28 07:20:00,674 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2003496.0, ans=0.0 2023-06-28 07:20:05,200 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.63 vs. limit=22.5 2023-06-28 07:20:11,288 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=2003556.0, ans=0.025 2023-06-28 07:20:16,273 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2003556.0, ans=0.125 2023-06-28 07:20:27,221 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=2003616.0, ans=0.035 2023-06-28 07:20:36,317 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.876e+02 7.501e+02 1.038e+03 1.526e+03 3.753e+03, threshold=2.076e+03, percent-clipped=10.0 2023-06-28 07:20:54,746 INFO [train.py:996] (2/4) Epoch 11, batch 29000, loss[loss=0.2959, simple_loss=0.3506, pruned_loss=0.1206, over 21334.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.304, pruned_loss=0.07238, over 4280585.37 frames. ], batch size: 507, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 07:20:55,502 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=2003676.0, ans=0.2 2023-06-28 07:21:02,178 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=2003676.0, ans=0.125 2023-06-28 07:21:28,688 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=2003736.0, ans=0.0 2023-06-28 07:21:30,600 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2003736.0, ans=0.125 2023-06-28 07:21:46,699 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2003796.0, ans=0.125 2023-06-28 07:22:35,936 INFO [train.py:996] (2/4) Epoch 11, batch 29050, loss[loss=0.2151, simple_loss=0.285, pruned_loss=0.07264, over 21871.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.3023, pruned_loss=0.07333, over 4278062.96 frames. ], batch size: 414, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 07:23:24,564 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=2004096.0, ans=0.2 2023-06-28 07:23:31,547 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.44 vs. limit=15.0 2023-06-28 07:24:04,795 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.241e+02 7.722e+02 1.077e+03 1.560e+03 2.970e+03, threshold=2.155e+03, percent-clipped=7.0 2023-06-28 07:24:18,299 INFO [train.py:996] (2/4) Epoch 11, batch 29100, loss[loss=0.1798, simple_loss=0.2412, pruned_loss=0.05919, over 21563.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.2935, pruned_loss=0.0711, over 4280578.06 frames. ], batch size: 231, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 07:25:15,265 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.74 vs. limit=12.0 2023-06-28 07:25:22,618 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2004456.0, ans=0.1 2023-06-28 07:25:30,296 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2004456.0, ans=0.0 2023-06-28 07:25:56,679 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2004516.0, ans=0.125 2023-06-28 07:25:59,496 INFO [train.py:996] (2/4) Epoch 11, batch 29150, loss[loss=0.2092, simple_loss=0.2889, pruned_loss=0.06475, over 21522.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.292, pruned_loss=0.06969, over 4284201.42 frames. ], batch size: 230, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 07:26:39,094 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=2004696.0, ans=0.0 2023-06-28 07:26:56,247 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.28 vs. limit=6.0 2023-06-28 07:27:03,286 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2004756.0, ans=0.125 2023-06-28 07:27:11,791 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.59 vs. limit=6.0 2023-06-28 07:27:17,704 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=2004756.0, ans=0.125 2023-06-28 07:27:26,707 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.585e+02 7.263e+02 1.061e+03 1.748e+03 3.304e+03, threshold=2.122e+03, percent-clipped=12.0 2023-06-28 07:27:28,197 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=15.53 vs. limit=22.5 2023-06-28 07:27:32,283 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2004816.0, ans=0.125 2023-06-28 07:27:38,608 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2004876.0, ans=0.0 2023-06-28 07:27:39,641 INFO [train.py:996] (2/4) Epoch 11, batch 29200, loss[loss=0.2475, simple_loss=0.3083, pruned_loss=0.09336, over 21397.00 frames. ], tot_loss[loss=0.2129, simple_loss=0.2877, pruned_loss=0.06906, over 4269960.06 frames. ], batch size: 508, lr: 2.60e-03, grad_scale: 32.0 2023-06-28 07:27:45,570 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.20 vs. limit=6.0 2023-06-28 07:27:56,058 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=2004876.0, ans=0.2 2023-06-28 07:28:54,451 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=9.76 vs. limit=22.5 2023-06-28 07:29:26,300 INFO [train.py:996] (2/4) Epoch 11, batch 29250, loss[loss=0.2121, simple_loss=0.3048, pruned_loss=0.05965, over 21803.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2865, pruned_loss=0.06718, over 4274276.78 frames. ], batch size: 333, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 07:29:54,918 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.42 vs. limit=22.5 2023-06-28 07:30:52,284 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.090e+02 7.920e+02 1.206e+03 1.772e+03 3.423e+03, threshold=2.413e+03, percent-clipped=14.0 2023-06-28 07:31:08,129 INFO [train.py:996] (2/4) Epoch 11, batch 29300, loss[loss=0.1903, simple_loss=0.2597, pruned_loss=0.06043, over 21586.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.2882, pruned_loss=0.06606, over 4279876.27 frames. ], batch size: 298, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 07:31:12,630 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.23 vs. limit=15.0 2023-06-28 07:31:18,609 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=2005476.0, ans=0.125 2023-06-28 07:31:41,716 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=2005596.0, ans=0.125 2023-06-28 07:31:58,157 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2005596.0, ans=0.125 2023-06-28 07:32:11,856 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=2005656.0, ans=0.025 2023-06-28 07:32:22,137 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=2005656.0, ans=0.125 2023-06-28 07:32:46,277 INFO [train.py:996] (2/4) Epoch 11, batch 29350, loss[loss=0.2004, simple_loss=0.2885, pruned_loss=0.05619, over 21510.00 frames. ], tot_loss[loss=0.2071, simple_loss=0.2837, pruned_loss=0.06527, over 4282791.72 frames. ], batch size: 195, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 07:33:02,334 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2005836.0, ans=0.125 2023-06-28 07:33:12,809 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2005836.0, ans=0.1 2023-06-28 07:33:25,889 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=2005896.0, ans=0.0 2023-06-28 07:33:40,378 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=2005896.0, ans=0.125 2023-06-28 07:33:41,095 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.44 vs. limit=12.0 2023-06-28 07:34:18,450 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.524e+02 6.284e+02 9.416e+02 1.465e+03 2.688e+03, threshold=1.883e+03, percent-clipped=1.0 2023-06-28 07:34:30,002 INFO [train.py:996] (2/4) Epoch 11, batch 29400, loss[loss=0.1904, simple_loss=0.2916, pruned_loss=0.04462, over 20803.00 frames. ], tot_loss[loss=0.2056, simple_loss=0.2842, pruned_loss=0.06354, over 4263658.58 frames. ], batch size: 608, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 07:34:50,784 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=2006136.0, ans=0.025 2023-06-28 07:35:02,124 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.70 vs. limit=15.0 2023-06-28 07:35:12,208 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.70 vs. limit=22.5 2023-06-28 07:35:38,593 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2006256.0, ans=0.125 2023-06-28 07:36:05,817 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2006316.0, ans=0.0 2023-06-28 07:36:13,429 INFO [train.py:996] (2/4) Epoch 11, batch 29450, loss[loss=0.2382, simple_loss=0.3151, pruned_loss=0.08068, over 21269.00 frames. ], tot_loss[loss=0.204, simple_loss=0.2822, pruned_loss=0.06293, over 4261738.56 frames. ], batch size: 159, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 07:36:22,428 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2006376.0, ans=0.125 2023-06-28 07:37:03,349 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2006496.0, ans=0.0 2023-06-28 07:37:26,617 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2006556.0, ans=0.125 2023-06-28 07:37:26,667 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2006556.0, ans=0.125 2023-06-28 07:37:43,666 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.711e+02 7.454e+02 1.204e+03 1.829e+03 3.653e+03, threshold=2.407e+03, percent-clipped=22.0 2023-06-28 07:37:44,379 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 07:37:52,378 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2006616.0, ans=0.125 2023-06-28 07:37:52,432 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 07:37:54,986 INFO [train.py:996] (2/4) Epoch 11, batch 29500, loss[loss=0.2121, simple_loss=0.2901, pruned_loss=0.06711, over 21847.00 frames. ], tot_loss[loss=0.2094, simple_loss=0.2867, pruned_loss=0.06602, over 4268265.41 frames. ], batch size: 414, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 07:37:57,242 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2006676.0, ans=0.125 2023-06-28 07:38:10,791 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=2006736.0, ans=0.125 2023-06-28 07:38:42,720 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=2006796.0, ans=0.0 2023-06-28 07:38:57,082 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=2006796.0, ans=0.125 2023-06-28 07:39:01,986 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2006856.0, ans=0.0 2023-06-28 07:39:06,909 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2006856.0, ans=0.0 2023-06-28 07:39:15,926 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.70 vs. limit=12.0 2023-06-28 07:39:36,798 INFO [train.py:996] (2/4) Epoch 11, batch 29550, loss[loss=0.2133, simple_loss=0.2835, pruned_loss=0.07158, over 21342.00 frames. ], tot_loss[loss=0.2107, simple_loss=0.2866, pruned_loss=0.06738, over 4280821.72 frames. ], batch size: 176, lr: 2.59e-03, grad_scale: 16.0 2023-06-28 07:40:20,032 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=2007036.0, ans=0.0 2023-06-28 07:40:23,584 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2007096.0, ans=0.125 2023-06-28 07:41:08,577 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.953e+02 8.227e+02 1.182e+03 1.842e+03 6.634e+03, threshold=2.364e+03, percent-clipped=14.0 2023-06-28 07:41:19,886 INFO [train.py:996] (2/4) Epoch 11, batch 29600, loss[loss=0.3027, simple_loss=0.3968, pruned_loss=0.1043, over 21226.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2933, pruned_loss=0.0699, over 4284218.60 frames. ], batch size: 548, lr: 2.59e-03, grad_scale: 32.0 2023-06-28 07:42:08,861 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.81 vs. limit=12.0 2023-06-28 07:42:09,935 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=2007396.0, ans=0.125 2023-06-28 07:42:14,606 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2007396.0, ans=0.125 2023-06-28 07:42:36,379 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2007456.0, ans=0.125 2023-06-28 07:42:57,473 INFO [train.py:996] (2/4) Epoch 11, batch 29650, loss[loss=0.1904, simple_loss=0.26, pruned_loss=0.06044, over 21764.00 frames. ], tot_loss[loss=0.2118, simple_loss=0.29, pruned_loss=0.06679, over 4276315.17 frames. ], batch size: 247, lr: 2.59e-03, grad_scale: 16.0 2023-06-28 07:43:09,434 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2007576.0, ans=0.1 2023-06-28 07:43:54,347 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2007696.0, ans=0.125 2023-06-28 07:44:20,161 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=2007816.0, ans=0.125 2023-06-28 07:44:25,394 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=2007816.0, ans=0.95 2023-06-28 07:44:26,230 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.283e+02 7.090e+02 1.074e+03 1.668e+03 4.986e+03, threshold=2.147e+03, percent-clipped=16.0 2023-06-28 07:44:40,978 INFO [train.py:996] (2/4) Epoch 11, batch 29700, loss[loss=0.2034, simple_loss=0.312, pruned_loss=0.04742, over 19861.00 frames. ], tot_loss[loss=0.2123, simple_loss=0.2918, pruned_loss=0.06636, over 4281204.34 frames. ], batch size: 702, lr: 2.59e-03, grad_scale: 16.0 2023-06-28 07:45:23,791 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2007936.0, ans=0.125 2023-06-28 07:45:55,288 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2008056.0, ans=0.0 2023-06-28 07:45:57,800 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.79 vs. limit=12.0 2023-06-28 07:46:08,865 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2008116.0, ans=0.0 2023-06-28 07:46:22,815 INFO [train.py:996] (2/4) Epoch 11, batch 29750, loss[loss=0.2721, simple_loss=0.3513, pruned_loss=0.09644, over 21576.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.2963, pruned_loss=0.06629, over 4283733.23 frames. ], batch size: 507, lr: 2.59e-03, grad_scale: 16.0 2023-06-28 07:46:25,438 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.61 vs. limit=22.5 2023-06-28 07:46:35,299 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=2008176.0, ans=15.0 2023-06-28 07:46:36,163 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=2008176.0, ans=0.0 2023-06-28 07:46:40,987 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2008176.0, ans=0.0 2023-06-28 07:46:52,276 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=2008236.0, ans=0.0 2023-06-28 07:47:33,102 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=2008356.0, ans=0.125 2023-06-28 07:47:49,324 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.946e+02 7.116e+02 1.082e+03 1.518e+03 2.580e+03, threshold=2.164e+03, percent-clipped=5.0 2023-06-28 07:47:53,187 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2008416.0, ans=0.125 2023-06-28 07:48:07,947 INFO [train.py:996] (2/4) Epoch 11, batch 29800, loss[loss=0.2271, simple_loss=0.2972, pruned_loss=0.07853, over 21391.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.2972, pruned_loss=0.06726, over 4286936.06 frames. ], batch size: 159, lr: 2.59e-03, grad_scale: 16.0 2023-06-28 07:48:20,204 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2008476.0, ans=0.125 2023-06-28 07:48:43,924 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 07:48:55,913 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2008596.0, ans=0.1 2023-06-28 07:48:59,151 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2008596.0, ans=0.0 2023-06-28 07:48:59,611 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.56 vs. limit=12.0 2023-06-28 07:49:03,145 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.39 vs. limit=6.0 2023-06-28 07:49:29,888 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2008716.0, ans=0.125 2023-06-28 07:49:33,739 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.66 vs. limit=10.0 2023-06-28 07:49:43,442 INFO [train.py:996] (2/4) Epoch 11, batch 29850, loss[loss=0.1927, simple_loss=0.2717, pruned_loss=0.05692, over 21869.00 frames. ], tot_loss[loss=0.2119, simple_loss=0.2933, pruned_loss=0.06526, over 4284089.96 frames. ], batch size: 371, lr: 2.59e-03, grad_scale: 16.0 2023-06-28 07:50:54,370 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.02 vs. limit=12.0 2023-06-28 07:51:09,754 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.656e+02 6.362e+02 8.665e+02 1.424e+03 2.891e+03, threshold=1.733e+03, percent-clipped=5.0 2023-06-28 07:51:15,679 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=2009016.0, ans=0.0 2023-06-28 07:51:29,360 INFO [train.py:996] (2/4) Epoch 11, batch 29900, loss[loss=0.2328, simple_loss=0.3094, pruned_loss=0.07808, over 21326.00 frames. ], tot_loss[loss=0.2122, simple_loss=0.2919, pruned_loss=0.06625, over 4289705.68 frames. ], batch size: 143, lr: 2.59e-03, grad_scale: 16.0 2023-06-28 07:51:36,543 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2009076.0, ans=0.1 2023-06-28 07:51:55,805 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=2009136.0, ans=0.0 2023-06-28 07:52:19,144 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2009196.0, ans=0.125 2023-06-28 07:52:24,324 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=2009196.0, ans=0.125 2023-06-28 07:53:08,238 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn1.whiten.whitening_limit, batch_count=2009316.0, ans=22.5 2023-06-28 07:53:11,886 INFO [train.py:996] (2/4) Epoch 11, batch 29950, loss[loss=0.2493, simple_loss=0.3296, pruned_loss=0.08449, over 21252.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.2964, pruned_loss=0.07038, over 4286315.49 frames. ], batch size: 143, lr: 2.59e-03, grad_scale: 16.0 2023-06-28 07:53:33,778 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=2009376.0, ans=0.2 2023-06-28 07:53:56,391 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.30 vs. limit=15.0 2023-06-28 07:54:01,064 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 07:54:50,050 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.989e+02 7.621e+02 1.240e+03 1.706e+03 3.587e+03, threshold=2.479e+03, percent-clipped=22.0 2023-06-28 07:55:02,817 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.66 vs. limit=15.0 2023-06-28 07:55:04,725 INFO [train.py:996] (2/4) Epoch 11, batch 30000, loss[loss=0.1893, simple_loss=0.2814, pruned_loss=0.04862, over 21734.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.2986, pruned_loss=0.06998, over 4283429.65 frames. ], batch size: 247, lr: 2.59e-03, grad_scale: 32.0 2023-06-28 07:55:04,726 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-28 07:55:21,711 INFO [train.py:1028] (2/4) Epoch 11, validation: loss=0.2519, simple_loss=0.3444, pruned_loss=0.07975, over 1796401.00 frames. 2023-06-28 07:55:21,712 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 23793MB 2023-06-28 07:55:22,854 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2009676.0, ans=0.0 2023-06-28 07:55:28,456 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2009676.0, ans=0.125 2023-06-28 07:57:10,907 INFO [train.py:996] (2/4) Epoch 11, batch 30050, loss[loss=0.2392, simple_loss=0.3474, pruned_loss=0.06555, over 21750.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.3005, pruned_loss=0.06723, over 4273127.82 frames. ], batch size: 332, lr: 2.59e-03, grad_scale: 32.0 2023-06-28 07:58:44,578 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.592e+02 6.904e+02 1.436e+03 2.249e+03 4.425e+03, threshold=2.873e+03, percent-clipped=20.0 2023-06-28 07:58:52,201 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2010276.0, ans=0.1 2023-06-28 07:58:53,201 INFO [train.py:996] (2/4) Epoch 11, batch 30100, loss[loss=0.1912, simple_loss=0.2503, pruned_loss=0.06605, over 21224.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2986, pruned_loss=0.06697, over 4257862.77 frames. ], batch size: 176, lr: 2.59e-03, grad_scale: 16.0 2023-06-28 07:59:05,921 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.15 vs. limit=22.5 2023-06-28 07:59:34,073 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2010336.0, ans=0.125 2023-06-28 07:59:43,857 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2010396.0, ans=0.125 2023-06-28 08:00:03,496 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.00 vs. limit=15.0 2023-06-28 08:00:05,933 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=2010456.0, ans=0.07 2023-06-28 08:00:36,696 INFO [train.py:996] (2/4) Epoch 11, batch 30150, loss[loss=0.2452, simple_loss=0.3269, pruned_loss=0.08174, over 21827.00 frames. ], tot_loss[loss=0.2164, simple_loss=0.2956, pruned_loss=0.06861, over 4246996.80 frames. ], batch size: 124, lr: 2.59e-03, grad_scale: 16.0 2023-06-28 08:01:29,227 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2010696.0, ans=0.125 2023-06-28 08:01:53,910 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.62 vs. limit=22.5 2023-06-28 08:02:02,942 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.08 vs. limit=12.0 2023-06-28 08:02:18,806 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.690e+02 6.702e+02 9.063e+02 1.523e+03 3.175e+03, threshold=1.813e+03, percent-clipped=2.0 2023-06-28 08:02:36,294 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.94 vs. limit=6.0 2023-06-28 08:02:36,691 INFO [train.py:996] (2/4) Epoch 11, batch 30200, loss[loss=0.2102, simple_loss=0.2894, pruned_loss=0.06551, over 21762.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2979, pruned_loss=0.06737, over 4257042.53 frames. ], batch size: 124, lr: 2.59e-03, grad_scale: 16.0 2023-06-28 08:02:43,179 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.71 vs. limit=15.0 2023-06-28 08:02:43,743 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2010876.0, ans=0.125 2023-06-28 08:03:01,422 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=2010936.0, ans=0.2 2023-06-28 08:03:01,422 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=2010936.0, ans=0.07 2023-06-28 08:03:15,513 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=2010996.0, ans=0.125 2023-06-28 08:04:14,282 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2011116.0, ans=0.125 2023-06-28 08:04:21,945 INFO [train.py:996] (2/4) Epoch 11, batch 30250, loss[loss=0.2367, simple_loss=0.346, pruned_loss=0.06376, over 21765.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.3058, pruned_loss=0.06968, over 4264953.21 frames. ], batch size: 282, lr: 2.59e-03, grad_scale: 16.0 2023-06-28 08:05:26,717 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.38 vs. limit=15.0 2023-06-28 08:05:41,991 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=2011416.0, ans=0.2 2023-06-28 08:05:53,404 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=2011416.0, ans=0.2 2023-06-28 08:05:55,953 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.393e+02 7.352e+02 1.154e+03 1.714e+03 3.720e+03, threshold=2.308e+03, percent-clipped=21.0 2023-06-28 08:06:04,346 INFO [train.py:996] (2/4) Epoch 11, batch 30300, loss[loss=0.1798, simple_loss=0.2489, pruned_loss=0.05532, over 21520.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.3034, pruned_loss=0.06943, over 4264616.65 frames. ], batch size: 231, lr: 2.59e-03, grad_scale: 16.0 2023-06-28 08:06:17,286 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2011476.0, ans=0.1 2023-06-28 08:07:26,907 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2011656.0, ans=0.125 2023-06-28 08:07:50,271 INFO [train.py:996] (2/4) Epoch 11, batch 30350, loss[loss=0.2033, simple_loss=0.2602, pruned_loss=0.07316, over 20217.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.3052, pruned_loss=0.07131, over 4262525.26 frames. ], batch size: 703, lr: 2.59e-03, grad_scale: 16.0 2023-06-28 08:08:29,730 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=2011896.0, ans=0.2 2023-06-28 08:08:35,258 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=2011956.0, ans=0.09899494936611666 2023-06-28 08:08:41,016 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2011956.0, ans=0.0 2023-06-28 08:08:42,548 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=2011956.0, ans=0.0 2023-06-28 08:08:57,088 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.778e+02 8.957e+02 1.374e+03 2.295e+03 4.777e+03, threshold=2.749e+03, percent-clipped=24.0 2023-06-28 08:09:11,745 INFO [train.py:996] (2/4) Epoch 11, batch 30400, loss[loss=0.1914, simple_loss=0.2451, pruned_loss=0.06888, over 20311.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.2996, pruned_loss=0.06993, over 4253267.74 frames. ], batch size: 703, lr: 2.59e-03, grad_scale: 32.0 2023-06-28 08:10:13,978 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=2012256.0, ans=0.125 2023-06-28 08:10:34,681 INFO [train.py:996] (2/4) Epoch 11, batch 30450, loss[loss=0.2603, simple_loss=0.3745, pruned_loss=0.07306, over 19943.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.2995, pruned_loss=0.0695, over 4196024.45 frames. ], batch size: 702, lr: 2.59e-03, grad_scale: 8.0 2023-06-28 08:10:50,088 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2012376.0, ans=0.125 2023-06-28 08:11:09,104 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 08:11:27,248 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2012556.0, ans=0.125 2023-06-28 08:13:53,298 INFO [train.py:996] (2/4) Epoch 12, batch 0, loss[loss=0.2001, simple_loss=0.2616, pruned_loss=0.06935, over 21395.00 frames. ], tot_loss[loss=0.2001, simple_loss=0.2616, pruned_loss=0.06935, over 21395.00 frames. ], batch size: 212, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:13:53,299 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-28 08:14:03,555 INFO [zipformer.py:1728] (2/4) name=encoder.encoders.5.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([2.3954, 2.3010, 4.3265, 4.0699], device='cuda:2') 2023-06-28 08:14:09,660 INFO [train.py:1028] (2/4) Epoch 12, validation: loss=0.2477, simple_loss=0.3485, pruned_loss=0.0734, over 1796401.00 frames. 2023-06-28 08:14:09,660 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 23793MB 2023-06-28 08:14:12,903 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 6.161e+02 1.803e+03 3.374e+03 5.381e+03 1.358e+04, threshold=6.748e+03, percent-clipped=56.0 2023-06-28 08:14:50,979 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.69 vs. limit=22.5 2023-06-28 08:15:19,098 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=2012826.0, ans=10.0 2023-06-28 08:15:33,180 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.69 vs. limit=15.0 2023-06-28 08:15:54,162 INFO [train.py:996] (2/4) Epoch 12, batch 50, loss[loss=0.1847, simple_loss=0.2624, pruned_loss=0.05346, over 21800.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.299, pruned_loss=0.06899, over 968875.76 frames. ], batch size: 124, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:16:05,322 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=2012946.0, ans=0.125 2023-06-28 08:16:33,718 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2013006.0, ans=0.1 2023-06-28 08:17:00,071 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2013066.0, ans=0.0 2023-06-28 08:17:04,946 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=2013126.0, ans=0.5 2023-06-28 08:17:09,866 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2013126.0, ans=0.1 2023-06-28 08:17:19,550 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2013186.0, ans=0.1 2023-06-28 08:17:37,278 INFO [train.py:996] (2/4) Epoch 12, batch 100, loss[loss=0.2383, simple_loss=0.328, pruned_loss=0.07429, over 21770.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.3154, pruned_loss=0.06999, over 1700423.32 frames. ], batch size: 247, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:17:40,464 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.118e+02 6.672e+02 9.899e+02 1.706e+03 3.699e+03, threshold=1.980e+03, percent-clipped=0.0 2023-06-28 08:17:52,624 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2013246.0, ans=0.0 2023-06-28 08:17:55,754 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=2013246.0, ans=0.125 2023-06-28 08:18:40,860 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.64 vs. limit=15.0 2023-06-28 08:18:42,272 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.09 vs. limit=10.0 2023-06-28 08:18:43,569 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=2013366.0, ans=0.0 2023-06-28 08:18:51,487 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=2013426.0, ans=0.125 2023-06-28 08:18:59,424 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=2013426.0, ans=0.5 2023-06-28 08:19:10,999 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2013486.0, ans=0.125 2023-06-28 08:19:18,619 INFO [train.py:996] (2/4) Epoch 12, batch 150, loss[loss=0.2568, simple_loss=0.3412, pruned_loss=0.08616, over 21738.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.3129, pruned_loss=0.06878, over 2273758.68 frames. ], batch size: 441, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:19:23,194 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2013546.0, ans=0.1 2023-06-28 08:19:28,511 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.15 vs. limit=22.5 2023-06-28 08:19:30,999 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=2013546.0, ans=0.025 2023-06-28 08:20:40,308 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2013786.0, ans=0.125 2023-06-28 08:20:57,644 INFO [train.py:996] (2/4) Epoch 12, batch 200, loss[loss=0.2321, simple_loss=0.3163, pruned_loss=0.07392, over 21455.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.3124, pruned_loss=0.06901, over 2721876.62 frames. ], batch size: 471, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:21:00,959 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.147e+02 7.904e+02 1.199e+03 1.656e+03 3.803e+03, threshold=2.398e+03, percent-clipped=21.0 2023-06-28 08:22:00,120 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.08 vs. limit=15.0 2023-06-28 08:22:13,578 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.96 vs. limit=12.0 2023-06-28 08:22:42,038 INFO [train.py:996] (2/4) Epoch 12, batch 250, loss[loss=0.2406, simple_loss=0.306, pruned_loss=0.0876, over 21407.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.3085, pruned_loss=0.06866, over 3060210.88 frames. ], batch size: 548, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:23:08,942 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2014206.0, ans=0.125 2023-06-28 08:23:24,093 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.50 vs. limit=15.0 2023-06-28 08:23:52,481 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2014266.0, ans=0.0 2023-06-28 08:24:13,233 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2014386.0, ans=0.125 2023-06-28 08:24:32,097 INFO [train.py:996] (2/4) Epoch 12, batch 300, loss[loss=0.2272, simple_loss=0.3125, pruned_loss=0.0709, over 21445.00 frames. ], tot_loss[loss=0.22, simple_loss=0.3034, pruned_loss=0.06833, over 3323351.60 frames. ], batch size: 548, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:24:35,392 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.745e+02 6.973e+02 9.077e+02 1.413e+03 3.093e+03, threshold=1.815e+03, percent-clipped=6.0 2023-06-28 08:24:57,241 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 08:25:50,617 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2014626.0, ans=0.125 2023-06-28 08:25:55,817 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2014626.0, ans=0.125 2023-06-28 08:26:20,885 INFO [train.py:996] (2/4) Epoch 12, batch 350, loss[loss=0.2554, simple_loss=0.3427, pruned_loss=0.08406, over 21459.00 frames. ], tot_loss[loss=0.215, simple_loss=0.2967, pruned_loss=0.06662, over 3522890.81 frames. ], batch size: 471, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:27:38,794 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=2014926.0, ans=0.125 2023-06-28 08:27:40,781 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2014926.0, ans=0.125 2023-06-28 08:28:05,213 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.77 vs. limit=12.0 2023-06-28 08:28:07,210 INFO [train.py:996] (2/4) Epoch 12, batch 400, loss[loss=0.1814, simple_loss=0.269, pruned_loss=0.04691, over 21700.00 frames. ], tot_loss[loss=0.2103, simple_loss=0.289, pruned_loss=0.06577, over 3679483.64 frames. ], batch size: 333, lr: 2.47e-03, grad_scale: 32.0 2023-06-28 08:28:10,667 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.793e+02 7.672e+02 1.106e+03 1.472e+03 3.614e+03, threshold=2.212e+03, percent-clipped=11.0 2023-06-28 08:28:13,049 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 08:28:36,519 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=2015106.0, ans=0.0 2023-06-28 08:29:53,468 INFO [train.py:996] (2/4) Epoch 12, batch 450, loss[loss=0.1793, simple_loss=0.2501, pruned_loss=0.05422, over 21786.00 frames. ], tot_loss[loss=0.206, simple_loss=0.2839, pruned_loss=0.06408, over 3813731.80 frames. ], batch size: 317, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:30:02,793 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2015346.0, ans=0.125 2023-06-28 08:31:37,510 INFO [train.py:996] (2/4) Epoch 12, batch 500, loss[loss=0.1717, simple_loss=0.2662, pruned_loss=0.03856, over 21613.00 frames. ], tot_loss[loss=0.2042, simple_loss=0.2844, pruned_loss=0.06201, over 3914298.76 frames. ], batch size: 263, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:31:42,488 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.576e+02 9.650e+02 1.378e+03 2.425e+03 6.087e+03, threshold=2.755e+03, percent-clipped=29.0 2023-06-28 08:31:47,109 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.03 vs. limit=22.5 2023-06-28 08:32:17,623 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.66 vs. limit=10.0 2023-06-28 08:32:46,835 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2015826.0, ans=0.0 2023-06-28 08:32:53,682 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2015826.0, ans=0.0 2023-06-28 08:33:22,129 INFO [train.py:996] (2/4) Epoch 12, batch 550, loss[loss=0.2248, simple_loss=0.301, pruned_loss=0.07424, over 21796.00 frames. ], tot_loss[loss=0.2064, simple_loss=0.2887, pruned_loss=0.062, over 3997385.81 frames. ], batch size: 112, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:34:45,476 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=2016186.0, ans=0.125 2023-06-28 08:35:00,970 INFO [train.py:996] (2/4) Epoch 12, batch 600, loss[loss=0.1967, simple_loss=0.267, pruned_loss=0.06317, over 21870.00 frames. ], tot_loss[loss=0.2109, simple_loss=0.2952, pruned_loss=0.06326, over 4067434.17 frames. ], batch size: 107, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:35:05,855 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.203e+02 8.466e+02 1.434e+03 2.194e+03 5.258e+03, threshold=2.867e+03, percent-clipped=12.0 2023-06-28 08:36:44,615 INFO [train.py:996] (2/4) Epoch 12, batch 650, loss[loss=0.2204, simple_loss=0.3025, pruned_loss=0.06912, over 21432.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2943, pruned_loss=0.06322, over 4118023.24 frames. ], batch size: 211, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:37:56,619 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=2016726.0, ans=0.125 2023-06-28 08:38:23,206 INFO [train.py:996] (2/4) Epoch 12, batch 700, loss[loss=0.2208, simple_loss=0.3027, pruned_loss=0.06939, over 21831.00 frames. ], tot_loss[loss=0.2097, simple_loss=0.293, pruned_loss=0.06316, over 4159193.72 frames. ], batch size: 107, lr: 2.47e-03, grad_scale: 8.0 2023-06-28 08:38:34,695 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.083e+02 8.629e+02 1.370e+03 1.985e+03 4.368e+03, threshold=2.739e+03, percent-clipped=8.0 2023-06-28 08:38:52,580 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.95 vs. limit=22.5 2023-06-28 08:39:04,728 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.99 vs. limit=15.0 2023-06-28 08:40:06,464 INFO [train.py:996] (2/4) Epoch 12, batch 750, loss[loss=0.1848, simple_loss=0.2644, pruned_loss=0.05264, over 21896.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.2953, pruned_loss=0.06507, over 4195128.18 frames. ], batch size: 316, lr: 2.47e-03, grad_scale: 8.0 2023-06-28 08:40:10,243 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=2017146.0, ans=0.0 2023-06-28 08:40:48,943 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2017266.0, ans=0.125 2023-06-28 08:40:56,982 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 08:41:25,320 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=2017326.0, ans=0.125 2023-06-28 08:41:28,990 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2017326.0, ans=0.125 2023-06-28 08:41:50,315 INFO [train.py:996] (2/4) Epoch 12, batch 800, loss[loss=0.1866, simple_loss=0.261, pruned_loss=0.05611, over 21802.00 frames. ], tot_loss[loss=0.2123, simple_loss=0.2925, pruned_loss=0.06603, over 4211635.05 frames. ], batch size: 247, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:41:53,190 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.28 vs. limit=10.0 2023-06-28 08:42:01,926 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.954e+02 9.081e+02 1.260e+03 2.091e+03 4.459e+03, threshold=2.521e+03, percent-clipped=14.0 2023-06-28 08:42:19,628 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=8.99 vs. limit=15.0 2023-06-28 08:42:38,767 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=2017566.0, ans=0.5 2023-06-28 08:42:38,896 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=2017566.0, ans=0.0 2023-06-28 08:42:50,360 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2017566.0, ans=0.125 2023-06-28 08:43:33,341 INFO [train.py:996] (2/4) Epoch 12, batch 850, loss[loss=0.2088, simple_loss=0.2798, pruned_loss=0.06892, over 21286.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.2895, pruned_loss=0.06645, over 4234419.68 frames. ], batch size: 176, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:43:45,114 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.60 vs. limit=10.0 2023-06-28 08:45:09,573 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff3.min_abs, batch_count=2017986.0, ans=0.2 2023-06-28 08:45:21,735 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2017986.0, ans=0.0 2023-06-28 08:45:24,335 INFO [train.py:996] (2/4) Epoch 12, batch 900, loss[loss=0.2114, simple_loss=0.2744, pruned_loss=0.07417, over 21629.00 frames. ], tot_loss[loss=0.2101, simple_loss=0.2877, pruned_loss=0.06624, over 4250011.45 frames. ], batch size: 391, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:45:35,211 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.21 vs. limit=15.0 2023-06-28 08:45:35,781 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.139e+02 7.978e+02 1.292e+03 1.942e+03 4.093e+03, threshold=2.584e+03, percent-clipped=13.0 2023-06-28 08:46:48,969 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=2018286.0, ans=0.0 2023-06-28 08:46:50,585 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2018286.0, ans=0.0 2023-06-28 08:47:14,380 INFO [train.py:996] (2/4) Epoch 12, batch 950, loss[loss=0.2315, simple_loss=0.3104, pruned_loss=0.07629, over 21872.00 frames. ], tot_loss[loss=0.2088, simple_loss=0.2864, pruned_loss=0.06562, over 4256252.62 frames. ], batch size: 371, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:47:20,018 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=2018346.0, ans=0.04949747468305833 2023-06-28 08:47:23,307 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=2018346.0, ans=0.2 2023-06-28 08:47:52,761 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 08:48:40,565 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2018586.0, ans=0.0 2023-06-28 08:48:42,309 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=2018586.0, ans=0.125 2023-06-28 08:48:56,773 INFO [train.py:996] (2/4) Epoch 12, batch 1000, loss[loss=0.1635, simple_loss=0.2506, pruned_loss=0.03815, over 21595.00 frames. ], tot_loss[loss=0.2084, simple_loss=0.2864, pruned_loss=0.06516, over 4263012.86 frames. ], batch size: 230, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:49:03,707 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.940e+02 7.062e+02 8.970e+02 1.402e+03 3.868e+03, threshold=1.794e+03, percent-clipped=7.0 2023-06-28 08:49:18,004 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 08:49:25,701 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.40 vs. limit=6.0 2023-06-28 08:49:36,302 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=2018706.0, ans=0.2 2023-06-28 08:49:38,152 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=2018706.0, ans=0.04949747468305833 2023-06-28 08:49:50,055 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2018766.0, ans=0.1 2023-06-28 08:50:15,854 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=2018826.0, ans=0.125 2023-06-28 08:50:42,114 INFO [train.py:996] (2/4) Epoch 12, batch 1050, loss[loss=0.2716, simple_loss=0.3388, pruned_loss=0.1022, over 21579.00 frames. ], tot_loss[loss=0.2084, simple_loss=0.2868, pruned_loss=0.06501, over 4265568.10 frames. ], batch size: 414, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:50:59,442 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2018946.0, ans=0.125 2023-06-28 08:51:04,721 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2019006.0, ans=0.1 2023-06-28 08:51:18,233 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.11 vs. limit=22.5 2023-06-28 08:51:31,347 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=2019066.0, ans=0.035 2023-06-28 08:52:31,737 INFO [train.py:996] (2/4) Epoch 12, batch 1100, loss[loss=0.186, simple_loss=0.2832, pruned_loss=0.0444, over 21773.00 frames. ], tot_loss[loss=0.2071, simple_loss=0.2866, pruned_loss=0.06386, over 4269348.99 frames. ], batch size: 351, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:52:36,108 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=2019246.0, ans=0.0 2023-06-28 08:52:36,713 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.25 vs. limit=15.0 2023-06-28 08:52:39,018 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.826e+02 7.483e+02 1.102e+03 1.696e+03 3.574e+03, threshold=2.203e+03, percent-clipped=22.0 2023-06-28 08:53:14,977 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=2019366.0, ans=0.0 2023-06-28 08:54:14,964 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.49 vs. limit=15.0 2023-06-28 08:54:17,193 INFO [train.py:996] (2/4) Epoch 12, batch 1150, loss[loss=0.2063, simple_loss=0.2841, pruned_loss=0.06421, over 21517.00 frames. ], tot_loss[loss=0.2067, simple_loss=0.2865, pruned_loss=0.06352, over 4274419.49 frames. ], batch size: 548, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:54:34,259 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=2019546.0, ans=0.0 2023-06-28 08:54:55,282 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2019666.0, ans=0.125 2023-06-28 08:55:50,334 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2019786.0, ans=0.125 2023-06-28 08:55:55,173 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2019786.0, ans=0.1 2023-06-28 08:56:08,751 INFO [train.py:996] (2/4) Epoch 12, batch 1200, loss[loss=0.2234, simple_loss=0.3129, pruned_loss=0.06697, over 21741.00 frames. ], tot_loss[loss=0.2083, simple_loss=0.2883, pruned_loss=0.06416, over 4276921.38 frames. ], batch size: 351, lr: 2.47e-03, grad_scale: 32.0 2023-06-28 08:56:15,505 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.003e+02 8.397e+02 1.494e+03 2.117e+03 4.524e+03, threshold=2.987e+03, percent-clipped=23.0 2023-06-28 08:56:36,055 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2019906.0, ans=0.125 2023-06-28 08:56:37,504 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2019906.0, ans=0.0 2023-06-28 08:57:24,236 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=2020026.0, ans=0.0 2023-06-28 08:57:46,778 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2020086.0, ans=0.125 2023-06-28 08:57:49,456 INFO [train.py:996] (2/4) Epoch 12, batch 1250, loss[loss=0.2022, simple_loss=0.3065, pruned_loss=0.04893, over 21750.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2901, pruned_loss=0.06539, over 4281726.68 frames. ], batch size: 351, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:58:23,039 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2020206.0, ans=0.125 2023-06-28 08:59:05,208 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=2020326.0, ans=0.0 2023-06-28 08:59:37,798 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2020386.0, ans=0.125 2023-06-28 08:59:40,420 INFO [train.py:996] (2/4) Epoch 12, batch 1300, loss[loss=0.2468, simple_loss=0.3414, pruned_loss=0.07608, over 21889.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.2919, pruned_loss=0.0664, over 4275787.60 frames. ], batch size: 372, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:59:48,735 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.742e+02 7.744e+02 1.078e+03 1.630e+03 3.241e+03, threshold=2.156e+03, percent-clipped=1.0 2023-06-28 08:59:54,512 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=2020446.0, ans=0.125 2023-06-28 08:59:59,593 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=2020506.0, ans=10.0 2023-06-28 09:01:14,366 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2020686.0, ans=0.0 2023-06-28 09:01:25,438 INFO [train.py:996] (2/4) Epoch 12, batch 1350, loss[loss=0.1902, simple_loss=0.257, pruned_loss=0.06172, over 21236.00 frames. ], tot_loss[loss=0.2113, simple_loss=0.2909, pruned_loss=0.0658, over 4278467.42 frames. ], batch size: 608, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 09:01:48,360 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=2020806.0, ans=0.125 2023-06-28 09:02:21,837 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=2020926.0, ans=0.0 2023-06-28 09:02:40,585 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2020926.0, ans=0.0 2023-06-28 09:02:45,436 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=2020986.0, ans=0.04949747468305833 2023-06-28 09:02:47,136 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=2020986.0, ans=0.0 2023-06-28 09:02:50,175 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2020986.0, ans=0.125 2023-06-28 09:03:04,986 INFO [train.py:996] (2/4) Epoch 12, batch 1400, loss[loss=0.1943, simple_loss=0.2692, pruned_loss=0.05975, over 21705.00 frames. ], tot_loss[loss=0.2101, simple_loss=0.2881, pruned_loss=0.06606, over 4277281.53 frames. ], batch size: 230, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 09:03:13,318 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.716e+02 8.874e+02 1.255e+03 1.971e+03 3.857e+03, threshold=2.510e+03, percent-clipped=18.0 2023-06-28 09:04:44,261 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2021286.0, ans=0.125 2023-06-28 09:04:50,276 INFO [train.py:996] (2/4) Epoch 12, batch 1450, loss[loss=0.2209, simple_loss=0.3021, pruned_loss=0.06981, over 21597.00 frames. ], tot_loss[loss=0.2118, simple_loss=0.2893, pruned_loss=0.06713, over 4277328.40 frames. ], batch size: 112, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 09:05:09,458 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=2021406.0, ans=0.0 2023-06-28 09:05:11,452 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=2021406.0, ans=0.0 2023-06-28 09:05:25,580 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.72 vs. limit=15.0 2023-06-28 09:06:12,720 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2021526.0, ans=0.0 2023-06-28 09:06:26,092 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2021586.0, ans=0.125 2023-06-28 09:06:36,223 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_ff2.min_abs, batch_count=2021646.0, ans=0.1 2023-06-28 09:06:37,302 INFO [train.py:996] (2/4) Epoch 12, batch 1500, loss[loss=0.2401, simple_loss=0.3209, pruned_loss=0.07966, over 21523.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.291, pruned_loss=0.06828, over 4283365.91 frames. ], batch size: 414, lr: 2.47e-03, grad_scale: 8.0 2023-06-28 09:06:37,847 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=2021646.0, ans=0.125 2023-06-28 09:06:47,785 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.815e+02 8.356e+02 1.274e+03 1.855e+03 4.343e+03, threshold=2.548e+03, percent-clipped=12.0 2023-06-28 09:08:24,459 INFO [train.py:996] (2/4) Epoch 12, batch 1550, loss[loss=0.1666, simple_loss=0.2682, pruned_loss=0.03252, over 19909.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.2912, pruned_loss=0.06751, over 4286102.55 frames. ], batch size: 702, lr: 2.47e-03, grad_scale: 8.0 2023-06-28 09:08:25,369 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=2021946.0, ans=0.0 2023-06-28 09:08:33,814 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=2021946.0, ans=0.0 2023-06-28 09:08:54,639 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2022006.0, ans=0.125 2023-06-28 09:09:13,269 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 09:09:40,030 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=2022126.0, ans=0.125 2023-06-28 09:09:53,590 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2022186.0, ans=0.0 2023-06-28 09:10:00,377 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 09:10:07,219 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=2022186.0, ans=0.125 2023-06-28 09:10:07,341 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=2022186.0, ans=0.95 2023-06-28 09:10:09,937 INFO [train.py:996] (2/4) Epoch 12, batch 1600, loss[loss=0.1974, simple_loss=0.2875, pruned_loss=0.05359, over 21826.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2921, pruned_loss=0.06754, over 4280336.64 frames. ], batch size: 282, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 09:10:20,073 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.882e+02 7.910e+02 1.210e+03 1.920e+03 3.790e+03, threshold=2.419e+03, percent-clipped=9.0 2023-06-28 09:10:22,232 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=2022246.0, ans=0.0 2023-06-28 09:10:54,652 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.54 vs. limit=15.0 2023-06-28 09:11:36,127 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2022426.0, ans=0.125 2023-06-28 09:11:38,546 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.62 vs. limit=15.0 2023-06-28 09:11:58,024 INFO [train.py:996] (2/4) Epoch 12, batch 1650, loss[loss=0.1802, simple_loss=0.2742, pruned_loss=0.04313, over 21661.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.2892, pruned_loss=0.06656, over 4279264.90 frames. ], batch size: 247, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 09:12:21,627 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.62 vs. limit=15.0 2023-06-28 09:13:07,211 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2022666.0, ans=0.0 2023-06-28 09:13:19,419 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 09:13:29,102 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.18 vs. limit=22.5 2023-06-28 09:13:45,608 INFO [train.py:996] (2/4) Epoch 12, batch 1700, loss[loss=0.2073, simple_loss=0.2839, pruned_loss=0.0653, over 21589.00 frames. ], tot_loss[loss=0.2147, simple_loss=0.2927, pruned_loss=0.06833, over 4279752.51 frames. ], batch size: 263, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 09:13:55,788 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.138e+02 6.859e+02 1.024e+03 1.407e+03 3.205e+03, threshold=2.048e+03, percent-clipped=5.0 2023-06-28 09:14:48,204 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.72 vs. limit=15.0 2023-06-28 09:15:06,661 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2023026.0, ans=0.125 2023-06-28 09:15:32,588 INFO [train.py:996] (2/4) Epoch 12, batch 1750, loss[loss=0.1557, simple_loss=0.2367, pruned_loss=0.03737, over 21375.00 frames. ], tot_loss[loss=0.2123, simple_loss=0.2914, pruned_loss=0.06656, over 4281670.52 frames. ], batch size: 194, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 09:15:58,416 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2023206.0, ans=0.0 2023-06-28 09:17:10,930 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2023386.0, ans=0.1 2023-06-28 09:17:25,807 INFO [train.py:996] (2/4) Epoch 12, batch 1800, loss[loss=0.2086, simple_loss=0.303, pruned_loss=0.05706, over 21647.00 frames. ], tot_loss[loss=0.2109, simple_loss=0.2926, pruned_loss=0.06458, over 4283320.17 frames. ], batch size: 247, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 09:17:28,266 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2023446.0, ans=0.125 2023-06-28 09:17:46,591 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.284e+02 7.829e+02 1.190e+03 1.910e+03 4.483e+03, threshold=2.381e+03, percent-clipped=19.0 2023-06-28 09:19:11,624 INFO [train.py:996] (2/4) Epoch 12, batch 1850, loss[loss=0.2116, simple_loss=0.2951, pruned_loss=0.06406, over 21637.00 frames. ], tot_loss[loss=0.21, simple_loss=0.2938, pruned_loss=0.06313, over 4279444.26 frames. ], batch size: 263, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 09:19:36,810 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=12.81 vs. limit=15.0 2023-06-28 09:19:56,827 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=2023866.0, ans=0.125 2023-06-28 09:20:02,636 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.22 vs. limit=15.0 2023-06-28 09:20:10,384 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2023866.0, ans=0.125 2023-06-28 09:20:15,464 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2023926.0, ans=0.1 2023-06-28 09:20:15,514 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2023926.0, ans=0.0 2023-06-28 09:20:59,932 INFO [train.py:996] (2/4) Epoch 12, batch 1900, loss[loss=0.2786, simple_loss=0.3358, pruned_loss=0.1107, over 21683.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.2941, pruned_loss=0.06314, over 4280885.55 frames. ], batch size: 507, lr: 2.47e-03, grad_scale: 8.0 2023-06-28 09:21:22,278 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.886e+02 8.389e+02 1.357e+03 2.180e+03 3.591e+03, threshold=2.714e+03, percent-clipped=20.0 2023-06-28 09:21:38,304 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2024106.0, ans=0.125 2023-06-28 09:22:54,025 INFO [train.py:996] (2/4) Epoch 12, batch 1950, loss[loss=0.2038, simple_loss=0.2713, pruned_loss=0.06816, over 21833.00 frames. ], tot_loss[loss=0.2081, simple_loss=0.29, pruned_loss=0.06311, over 4279081.85 frames. ], batch size: 118, lr: 2.47e-03, grad_scale: 8.0 2023-06-28 09:23:10,147 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 09:24:40,573 INFO [train.py:996] (2/4) Epoch 12, batch 2000, loss[loss=0.249, simple_loss=0.3418, pruned_loss=0.07808, over 20029.00 frames. ], tot_loss[loss=0.2032, simple_loss=0.2839, pruned_loss=0.06128, over 4263546.01 frames. ], batch size: 702, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 09:24:52,587 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.543e+02 8.090e+02 1.262e+03 2.210e+03 4.405e+03, threshold=2.524e+03, percent-clipped=15.0 2023-06-28 09:25:55,852 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff3.min_abs, batch_count=2024826.0, ans=0.2 2023-06-28 09:26:10,723 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2024886.0, ans=0.0 2023-06-28 09:26:25,014 INFO [train.py:996] (2/4) Epoch 12, batch 2050, loss[loss=0.2295, simple_loss=0.3251, pruned_loss=0.0669, over 21555.00 frames. ], tot_loss[loss=0.2035, simple_loss=0.2852, pruned_loss=0.0609, over 4267849.72 frames. ], batch size: 473, lr: 2.47e-03, grad_scale: 8.0 2023-06-28 09:27:13,098 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=2025066.0, ans=0.0 2023-06-28 09:27:14,769 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=2025066.0, ans=0.125 2023-06-28 09:27:14,805 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2025066.0, ans=0.125 2023-06-28 09:27:53,665 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2025186.0, ans=0.125 2023-06-28 09:28:07,569 INFO [train.py:996] (2/4) Epoch 12, batch 2100, loss[loss=0.2046, simple_loss=0.2826, pruned_loss=0.06325, over 21252.00 frames. ], tot_loss[loss=0.2072, simple_loss=0.2892, pruned_loss=0.0626, over 4275656.19 frames. ], batch size: 549, lr: 2.47e-03, grad_scale: 8.0 2023-06-28 09:28:19,265 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=10.11 vs. limit=15.0 2023-06-28 09:28:21,406 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.339e+02 9.933e+02 1.500e+03 2.145e+03 4.437e+03, threshold=3.000e+03, percent-clipped=17.0 2023-06-28 09:28:47,223 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=2025366.0, ans=0.0 2023-06-28 09:28:48,684 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 09:29:52,497 INFO [train.py:996] (2/4) Epoch 12, batch 2150, loss[loss=0.1753, simple_loss=0.2431, pruned_loss=0.05374, over 21594.00 frames. ], tot_loss[loss=0.2089, simple_loss=0.2896, pruned_loss=0.06412, over 4272187.05 frames. ], batch size: 263, lr: 2.47e-03, grad_scale: 8.0 2023-06-28 09:29:53,205 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=2025546.0, ans=0.0 2023-06-28 09:29:59,774 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2025546.0, ans=0.0 2023-06-28 09:30:04,908 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=2025546.0, ans=0.125 2023-06-28 09:30:47,679 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2025726.0, ans=0.1 2023-06-28 09:30:49,425 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=2025726.0, ans=0.125 2023-06-28 09:31:37,757 INFO [train.py:996] (2/4) Epoch 12, batch 2200, loss[loss=0.2157, simple_loss=0.2725, pruned_loss=0.07943, over 21401.00 frames. ], tot_loss[loss=0.2082, simple_loss=0.2887, pruned_loss=0.06384, over 4267232.36 frames. ], batch size: 473, lr: 2.46e-03, grad_scale: 8.0 2023-06-28 09:31:51,397 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.594e+02 7.144e+02 1.049e+03 1.524e+03 3.402e+03, threshold=2.098e+03, percent-clipped=4.0 2023-06-28 09:32:48,932 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=2026026.0, ans=0.125 2023-06-28 09:32:52,448 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=2026026.0, ans=0.2 2023-06-28 09:33:21,820 INFO [train.py:996] (2/4) Epoch 12, batch 2250, loss[loss=0.249, simple_loss=0.3385, pruned_loss=0.07972, over 21479.00 frames. ], tot_loss[loss=0.2072, simple_loss=0.2882, pruned_loss=0.06313, over 4269021.57 frames. ], batch size: 471, lr: 2.46e-03, grad_scale: 8.0 2023-06-28 09:33:42,459 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=2026206.0, ans=0.2 2023-06-28 09:34:27,611 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=2026326.0, ans=0.05 2023-06-28 09:34:40,147 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2026326.0, ans=0.125 2023-06-28 09:35:06,612 INFO [train.py:996] (2/4) Epoch 12, batch 2300, loss[loss=0.193, simple_loss=0.2713, pruned_loss=0.05736, over 21650.00 frames. ], tot_loss[loss=0.2047, simple_loss=0.2841, pruned_loss=0.06266, over 4267525.52 frames. ], batch size: 263, lr: 2.46e-03, grad_scale: 8.0 2023-06-28 09:35:14,041 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=2026446.0, ans=0.125 2023-06-28 09:35:14,061 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=2026446.0, ans=0.0 2023-06-28 09:35:15,673 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 09:35:18,032 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.85 vs. limit=15.0 2023-06-28 09:35:20,310 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.470e+02 7.194e+02 1.165e+03 1.936e+03 3.464e+03, threshold=2.331e+03, percent-clipped=21.0 2023-06-28 09:35:39,775 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=2026506.0, ans=0.125 2023-06-28 09:35:50,783 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.01 vs. limit=10.0 2023-06-28 09:36:10,467 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=2026626.0, ans=0.0 2023-06-28 09:36:10,477 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2026626.0, ans=0.0 2023-06-28 09:36:53,097 INFO [train.py:996] (2/4) Epoch 12, batch 2350, loss[loss=0.205, simple_loss=0.2791, pruned_loss=0.06546, over 21611.00 frames. ], tot_loss[loss=0.2045, simple_loss=0.2821, pruned_loss=0.06348, over 4268932.67 frames. ], batch size: 263, lr: 2.46e-03, grad_scale: 8.0 2023-06-28 09:36:54,040 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2026746.0, ans=0.125 2023-06-28 09:37:50,444 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.05 vs. limit=12.0 2023-06-28 09:38:03,750 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=2026926.0, ans=0.0 2023-06-28 09:38:25,595 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=2026986.0, ans=0.125 2023-06-28 09:38:38,345 INFO [train.py:996] (2/4) Epoch 12, batch 2400, loss[loss=0.2267, simple_loss=0.3031, pruned_loss=0.07514, over 21863.00 frames. ], tot_loss[loss=0.2094, simple_loss=0.2873, pruned_loss=0.06578, over 4262749.35 frames. ], batch size: 118, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 09:38:57,280 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.929e+02 7.615e+02 1.092e+03 1.757e+03 3.744e+03, threshold=2.185e+03, percent-clipped=12.0 2023-06-28 09:39:03,079 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2027106.0, ans=0.0 2023-06-28 09:39:27,013 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2027166.0, ans=0.0 2023-06-28 09:40:24,042 INFO [train.py:996] (2/4) Epoch 12, batch 2450, loss[loss=0.1845, simple_loss=0.2603, pruned_loss=0.05429, over 21619.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2928, pruned_loss=0.06866, over 4272068.44 frames. ], batch size: 332, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 09:40:38,861 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.41 vs. limit=15.0 2023-06-28 09:40:58,831 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=2027406.0, ans=0.125 2023-06-28 09:41:25,911 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2027466.0, ans=0.125 2023-06-28 09:41:29,198 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2027526.0, ans=0.125 2023-06-28 09:41:49,567 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2027526.0, ans=0.125 2023-06-28 09:42:08,459 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.16 vs. limit=6.0 2023-06-28 09:42:08,966 INFO [train.py:996] (2/4) Epoch 12, batch 2500, loss[loss=0.1822, simple_loss=0.2557, pruned_loss=0.05432, over 21727.00 frames. ], tot_loss[loss=0.212, simple_loss=0.2894, pruned_loss=0.06737, over 4277030.67 frames. ], batch size: 124, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 09:42:09,715 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2027646.0, ans=0.125 2023-06-28 09:42:27,038 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.946e+02 7.873e+02 1.330e+03 1.943e+03 4.895e+03, threshold=2.659e+03, percent-clipped=18.0 2023-06-28 09:43:15,084 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=2027826.0, ans=0.2 2023-06-28 09:43:43,998 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2027886.0, ans=0.125 2023-06-28 09:43:53,468 INFO [train.py:996] (2/4) Epoch 12, batch 2550, loss[loss=0.1955, simple_loss=0.3191, pruned_loss=0.03594, over 20796.00 frames. ], tot_loss[loss=0.2096, simple_loss=0.2874, pruned_loss=0.06589, over 4270473.29 frames. ], batch size: 607, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 09:44:23,068 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.64 vs. limit=12.0 2023-06-28 09:44:34,616 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.90 vs. limit=15.0 2023-06-28 09:45:37,011 INFO [train.py:996] (2/4) Epoch 12, batch 2600, loss[loss=0.1857, simple_loss=0.2606, pruned_loss=0.05543, over 21578.00 frames. ], tot_loss[loss=0.2092, simple_loss=0.2868, pruned_loss=0.06578, over 4248005.00 frames. ], batch size: 263, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 09:45:39,182 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=2028246.0, ans=0.125 2023-06-28 09:45:55,677 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.364e+02 1.004e+03 1.411e+03 2.308e+03 3.873e+03, threshold=2.822e+03, percent-clipped=11.0 2023-06-28 09:46:14,637 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2028366.0, ans=0.125 2023-06-28 09:46:36,361 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2028426.0, ans=0.1 2023-06-28 09:46:58,744 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=2028426.0, ans=0.015 2023-06-28 09:47:03,144 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=10.96 vs. limit=15.0 2023-06-28 09:47:09,270 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2028486.0, ans=0.125 2023-06-28 09:47:21,872 INFO [train.py:996] (2/4) Epoch 12, batch 2650, loss[loss=0.2326, simple_loss=0.3195, pruned_loss=0.07286, over 21771.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.2898, pruned_loss=0.0678, over 4260556.20 frames. ], batch size: 414, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 09:48:29,782 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=2028726.0, ans=0.125 2023-06-28 09:48:43,377 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=2028726.0, ans=0.2 2023-06-28 09:48:45,216 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2028726.0, ans=0.125 2023-06-28 09:48:51,531 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=2028786.0, ans=0.0 2023-06-28 09:48:56,536 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2028786.0, ans=0.125 2023-06-28 09:49:03,628 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.10 vs. limit=15.0 2023-06-28 09:49:07,744 INFO [train.py:996] (2/4) Epoch 12, batch 2700, loss[loss=0.2086, simple_loss=0.2775, pruned_loss=0.06981, over 21830.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.2899, pruned_loss=0.06751, over 4265088.94 frames. ], batch size: 298, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 09:49:25,942 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.017e+02 6.917e+02 8.915e+02 1.240e+03 3.062e+03, threshold=1.783e+03, percent-clipped=1.0 2023-06-28 09:50:51,156 INFO [train.py:996] (2/4) Epoch 12, batch 2750, loss[loss=0.1961, simple_loss=0.2887, pruned_loss=0.05177, over 21787.00 frames. ], tot_loss[loss=0.2107, simple_loss=0.2879, pruned_loss=0.06676, over 4269532.66 frames. ], batch size: 282, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 09:50:51,819 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2029146.0, ans=0.125 2023-06-28 09:50:55,291 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=2029146.0, ans=0.0 2023-06-28 09:52:02,557 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 09:52:43,517 INFO [train.py:996] (2/4) Epoch 12, batch 2800, loss[loss=0.2185, simple_loss=0.3041, pruned_loss=0.06646, over 21467.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2924, pruned_loss=0.06781, over 4274340.65 frames. ], batch size: 211, lr: 2.46e-03, grad_scale: 32.0 2023-06-28 09:52:56,261 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2029446.0, ans=0.125 2023-06-28 09:52:58,757 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.305e+02 8.151e+02 1.437e+03 2.226e+03 4.806e+03, threshold=2.874e+03, percent-clipped=38.0 2023-06-28 09:53:00,864 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=2029506.0, ans=0.125 2023-06-28 09:53:11,708 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.75 vs. limit=15.0 2023-06-28 09:53:28,599 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=2029566.0, ans=0.0 2023-06-28 09:53:43,765 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=2029566.0, ans=0.07 2023-06-28 09:54:28,776 INFO [train.py:996] (2/4) Epoch 12, batch 2850, loss[loss=0.1525, simple_loss=0.2164, pruned_loss=0.04435, over 21277.00 frames. ], tot_loss[loss=0.2156, simple_loss=0.2924, pruned_loss=0.06937, over 4274025.35 frames. ], batch size: 176, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 09:54:57,776 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=2029806.0, ans=0.5 2023-06-28 09:55:05,778 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 09:55:07,550 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=2029806.0, ans=0.2 2023-06-28 09:55:31,112 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=2029866.0, ans=0.125 2023-06-28 09:55:59,805 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2029986.0, ans=0.125 2023-06-28 09:56:12,488 INFO [train.py:996] (2/4) Epoch 12, batch 2900, loss[loss=0.1842, simple_loss=0.2569, pruned_loss=0.05572, over 21830.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.2886, pruned_loss=0.06845, over 4283395.62 frames. ], batch size: 282, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 09:56:27,897 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.497e+02 8.368e+02 1.188e+03 2.037e+03 3.726e+03, threshold=2.377e+03, percent-clipped=4.0 2023-06-28 09:56:48,900 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.55 vs. limit=15.0 2023-06-28 09:57:25,045 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2030226.0, ans=0.1 2023-06-28 09:57:54,184 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=2030286.0, ans=0.0 2023-06-28 09:57:55,749 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2030346.0, ans=0.125 2023-06-28 09:57:56,775 INFO [train.py:996] (2/4) Epoch 12, batch 2950, loss[loss=0.2825, simple_loss=0.3857, pruned_loss=0.08968, over 21268.00 frames. ], tot_loss[loss=0.2142, simple_loss=0.2906, pruned_loss=0.06891, over 4288011.36 frames. ], batch size: 548, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 09:58:35,295 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.09 vs. limit=12.0 2023-06-28 09:58:58,942 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.29 vs. limit=15.0 2023-06-28 09:59:41,600 INFO [train.py:996] (2/4) Epoch 12, batch 3000, loss[loss=0.2475, simple_loss=0.3327, pruned_loss=0.08112, over 21751.00 frames. ], tot_loss[loss=0.2183, simple_loss=0.2968, pruned_loss=0.06993, over 4291702.85 frames. ], batch size: 124, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 09:59:41,601 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-28 09:59:52,006 INFO [zipformer.py:1728] (2/4) name=encoder.encoders.1.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([5.0541, 5.1964, 2.4451, 4.6917], device='cuda:2') 2023-06-28 10:00:03,550 INFO [train.py:1028] (2/4) Epoch 12, validation: loss=0.2539, simple_loss=0.3416, pruned_loss=0.08306, over 1796401.00 frames. 2023-06-28 10:00:03,550 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 23793MB 2023-06-28 10:00:24,301 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.619e+02 8.195e+02 1.192e+03 1.732e+03 4.635e+03, threshold=2.384e+03, percent-clipped=12.0 2023-06-28 10:00:40,316 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.65 vs. limit=6.0 2023-06-28 10:01:13,130 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=2030826.0, ans=0.07 2023-06-28 10:01:14,875 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=2030826.0, ans=0.0 2023-06-28 10:01:42,908 INFO [train.py:996] (2/4) Epoch 12, batch 3050, loss[loss=0.1866, simple_loss=0.2763, pruned_loss=0.04841, over 21782.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2962, pruned_loss=0.0682, over 4296265.32 frames. ], batch size: 332, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:02:06,877 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=2030946.0, ans=0.0 2023-06-28 10:02:20,039 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2031006.0, ans=0.125 2023-06-28 10:02:32,273 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2031066.0, ans=0.125 2023-06-28 10:02:48,611 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=2031066.0, ans=0.0 2023-06-28 10:02:57,515 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=2031126.0, ans=0.0 2023-06-28 10:03:18,594 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=2031186.0, ans=0.0 2023-06-28 10:03:37,802 INFO [train.py:996] (2/4) Epoch 12, batch 3100, loss[loss=0.2187, simple_loss=0.3129, pruned_loss=0.06224, over 21640.00 frames. ], tot_loss[loss=0.2147, simple_loss=0.2952, pruned_loss=0.06707, over 4297566.73 frames. ], batch size: 263, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:03:57,010 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.760e+02 7.796e+02 1.121e+03 1.860e+03 4.097e+03, threshold=2.242e+03, percent-clipped=9.0 2023-06-28 10:04:32,203 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 10:05:25,313 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2031486.0, ans=0.1 2023-06-28 10:05:27,750 INFO [train.py:996] (2/4) Epoch 12, batch 3150, loss[loss=0.3079, simple_loss=0.3622, pruned_loss=0.1268, over 21457.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2956, pruned_loss=0.0673, over 4294582.18 frames. ], batch size: 510, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:05:55,131 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=2031606.0, ans=0.125 2023-06-28 10:05:58,802 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2031606.0, ans=0.125 2023-06-28 10:06:41,097 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=2031726.0, ans=0.0 2023-06-28 10:06:51,008 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=2031786.0, ans=0.125 2023-06-28 10:07:12,537 INFO [train.py:996] (2/4) Epoch 12, batch 3200, loss[loss=0.2446, simple_loss=0.3285, pruned_loss=0.0803, over 21724.00 frames. ], tot_loss[loss=0.2164, simple_loss=0.2971, pruned_loss=0.06785, over 4292992.23 frames. ], batch size: 441, lr: 2.46e-03, grad_scale: 32.0 2023-06-28 10:07:32,468 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.058e+02 7.728e+02 1.156e+03 1.759e+03 4.154e+03, threshold=2.311e+03, percent-clipped=17.0 2023-06-28 10:07:38,362 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2031906.0, ans=0.125 2023-06-28 10:07:50,217 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2031966.0, ans=0.0 2023-06-28 10:09:00,240 INFO [train.py:996] (2/4) Epoch 12, batch 3250, loss[loss=0.2233, simple_loss=0.2902, pruned_loss=0.07824, over 21222.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.2985, pruned_loss=0.06909, over 4293905.34 frames. ], batch size: 176, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:09:56,464 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.72 vs. limit=15.0 2023-06-28 10:10:12,770 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=2032326.0, ans=0.07 2023-06-28 10:10:39,244 INFO [train.py:996] (2/4) Epoch 12, batch 3300, loss[loss=0.2724, simple_loss=0.3609, pruned_loss=0.092, over 21452.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2935, pruned_loss=0.06859, over 4290017.17 frames. ], batch size: 507, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:10:56,000 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.931e+02 8.073e+02 1.537e+03 2.186e+03 4.176e+03, threshold=3.073e+03, percent-clipped=21.0 2023-06-28 10:11:47,789 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2032626.0, ans=0.0 2023-06-28 10:12:16,501 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.31 vs. limit=22.5 2023-06-28 10:12:22,316 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2032746.0, ans=0.0 2023-06-28 10:12:22,953 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.12 vs. limit=6.0 2023-06-28 10:12:23,347 INFO [train.py:996] (2/4) Epoch 12, batch 3350, loss[loss=0.2358, simple_loss=0.3099, pruned_loss=0.08088, over 21330.00 frames. ], tot_loss[loss=0.216, simple_loss=0.2944, pruned_loss=0.06884, over 4283792.99 frames. ], batch size: 176, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:13:01,257 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_ff3.min_abs, batch_count=2032866.0, ans=0.2 2023-06-28 10:13:27,974 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.54 vs. limit=22.5 2023-06-28 10:13:30,967 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=2032926.0, ans=0.125 2023-06-28 10:13:42,782 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2032926.0, ans=0.125 2023-06-28 10:14:06,587 INFO [train.py:996] (2/4) Epoch 12, batch 3400, loss[loss=0.2228, simple_loss=0.3143, pruned_loss=0.06562, over 21884.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.2958, pruned_loss=0.07001, over 4282837.73 frames. ], batch size: 371, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:14:28,084 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.882e+02 7.652e+02 1.057e+03 1.709e+03 3.627e+03, threshold=2.113e+03, percent-clipped=2.0 2023-06-28 10:14:32,798 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.04 vs. limit=15.0 2023-06-28 10:15:03,283 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2033166.0, ans=0.125 2023-06-28 10:15:06,476 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2033166.0, ans=0.125 2023-06-28 10:15:39,098 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=2033286.0, ans=0.2 2023-06-28 10:15:50,723 INFO [train.py:996] (2/4) Epoch 12, batch 3450, loss[loss=0.209, simple_loss=0.2858, pruned_loss=0.06609, over 21737.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2917, pruned_loss=0.06956, over 4281625.90 frames. ], batch size: 316, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:15:53,104 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2033346.0, ans=0.1 2023-06-28 10:16:55,404 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2033526.0, ans=0.125 2023-06-28 10:17:24,252 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.83 vs. limit=15.0 2023-06-28 10:17:27,019 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=2033586.0, ans=0.2 2023-06-28 10:17:35,124 INFO [train.py:996] (2/4) Epoch 12, batch 3500, loss[loss=0.2518, simple_loss=0.3357, pruned_loss=0.08396, over 21565.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.2998, pruned_loss=0.07263, over 4276510.38 frames. ], batch size: 230, lr: 2.46e-03, grad_scale: 8.0 2023-06-28 10:18:03,137 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.129e+02 8.730e+02 1.318e+03 1.854e+03 3.895e+03, threshold=2.636e+03, percent-clipped=20.0 2023-06-28 10:18:41,741 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.03 vs. limit=15.0 2023-06-28 10:19:23,748 INFO [train.py:996] (2/4) Epoch 12, batch 3550, loss[loss=0.1973, simple_loss=0.2724, pruned_loss=0.06109, over 21757.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.3026, pruned_loss=0.07339, over 4272944.20 frames. ], batch size: 351, lr: 2.46e-03, grad_scale: 8.0 2023-06-28 10:19:58,077 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2034006.0, ans=0.125 2023-06-28 10:20:11,714 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=2034066.0, ans=0.125 2023-06-28 10:20:58,917 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2034186.0, ans=0.0 2023-06-28 10:21:06,680 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2034246.0, ans=0.0 2023-06-28 10:21:12,850 INFO [train.py:996] (2/4) Epoch 12, batch 3600, loss[loss=0.3355, simple_loss=0.4265, pruned_loss=0.1223, over 21608.00 frames. ], tot_loss[loss=0.2214, simple_loss=0.2978, pruned_loss=0.07247, over 4274719.26 frames. ], batch size: 441, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:21:23,585 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2034246.0, ans=0.1 2023-06-28 10:21:23,625 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2034246.0, ans=0.1 2023-06-28 10:21:24,230 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.32 vs. limit=15.0 2023-06-28 10:21:31,745 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.278e+02 7.988e+02 1.219e+03 1.896e+03 5.241e+03, threshold=2.438e+03, percent-clipped=11.0 2023-06-28 10:22:29,563 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=2034486.0, ans=0.2 2023-06-28 10:22:31,198 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=2034486.0, ans=0.0 2023-06-28 10:22:51,695 INFO [train.py:996] (2/4) Epoch 12, batch 3650, loss[loss=0.2198, simple_loss=0.3292, pruned_loss=0.05521, over 20951.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.2972, pruned_loss=0.07252, over 4280533.88 frames. ], batch size: 608, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:23:32,226 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2034606.0, ans=0.0 2023-06-28 10:23:36,996 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=2034666.0, ans=0.2 2023-06-28 10:24:08,560 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=2034786.0, ans=10.0 2023-06-28 10:24:33,951 INFO [train.py:996] (2/4) Epoch 12, batch 3700, loss[loss=0.1993, simple_loss=0.2759, pruned_loss=0.06131, over 21840.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.2949, pruned_loss=0.07117, over 4282059.50 frames. ], batch size: 247, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:24:42,737 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=2034846.0, ans=0.2 2023-06-28 10:24:42,778 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2034846.0, ans=0.125 2023-06-28 10:24:48,189 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.91 vs. limit=6.0 2023-06-28 10:24:57,049 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.964e+02 7.431e+02 1.073e+03 1.535e+03 4.329e+03, threshold=2.147e+03, percent-clipped=8.0 2023-06-28 10:26:17,532 INFO [train.py:996] (2/4) Epoch 12, batch 3750, loss[loss=0.2076, simple_loss=0.2862, pruned_loss=0.06452, over 21823.00 frames. ], tot_loss[loss=0.2167, simple_loss=0.2935, pruned_loss=0.06994, over 4284624.59 frames. ], batch size: 112, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:26:24,007 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.97 vs. limit=22.5 2023-06-28 10:26:36,046 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.14 vs. limit=15.0 2023-06-28 10:26:48,529 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=2035206.0, ans=0.0 2023-06-28 10:27:11,978 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.92 vs. limit=15.0 2023-06-28 10:27:57,852 INFO [train.py:996] (2/4) Epoch 12, batch 3800, loss[loss=0.2434, simple_loss=0.3305, pruned_loss=0.07816, over 21801.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2921, pruned_loss=0.06901, over 4284133.13 frames. ], batch size: 124, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:28:19,081 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=2035506.0, ans=0.0 2023-06-28 10:28:21,913 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.866e+02 7.287e+02 1.012e+03 1.468e+03 2.920e+03, threshold=2.024e+03, percent-clipped=9.0 2023-06-28 10:28:32,252 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=2035506.0, ans=0.2 2023-06-28 10:28:44,410 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.97 vs. limit=6.0 2023-06-28 10:29:10,479 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.65 vs. limit=15.0 2023-06-28 10:29:40,116 INFO [train.py:996] (2/4) Epoch 12, batch 3850, loss[loss=0.2169, simple_loss=0.2909, pruned_loss=0.07148, over 21391.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.2899, pruned_loss=0.06883, over 4286760.57 frames. ], batch size: 211, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:31:23,413 INFO [train.py:996] (2/4) Epoch 12, batch 3900, loss[loss=0.2047, simple_loss=0.2657, pruned_loss=0.07184, over 21490.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.2888, pruned_loss=0.06953, over 4289334.03 frames. ], batch size: 441, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:31:47,274 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.137e+02 7.075e+02 9.098e+02 1.343e+03 3.131e+03, threshold=1.820e+03, percent-clipped=11.0 2023-06-28 10:32:22,795 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 10:33:08,682 INFO [train.py:996] (2/4) Epoch 12, batch 3950, loss[loss=0.1553, simple_loss=0.2409, pruned_loss=0.03483, over 21342.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.2893, pruned_loss=0.06845, over 4284084.67 frames. ], batch size: 194, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:33:33,220 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.82 vs. limit=15.0 2023-06-28 10:33:39,385 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2036406.0, ans=0.1 2023-06-28 10:33:45,957 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=2036466.0, ans=0.0 2023-06-28 10:34:25,237 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.35 vs. limit=8.0 2023-06-28 10:34:44,842 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=2036586.0, ans=0.2 2023-06-28 10:34:52,697 INFO [train.py:996] (2/4) Epoch 12, batch 4000, loss[loss=0.1969, simple_loss=0.2672, pruned_loss=0.06325, over 15124.00 frames. ], tot_loss[loss=0.2068, simple_loss=0.2826, pruned_loss=0.06551, over 4272179.04 frames. ], batch size: 62, lr: 2.46e-03, grad_scale: 32.0 2023-06-28 10:35:13,308 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=2036706.0, ans=0.09899494936611666 2023-06-28 10:35:16,139 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.215e+02 7.767e+02 1.100e+03 1.663e+03 3.671e+03, threshold=2.200e+03, percent-clipped=20.0 2023-06-28 10:35:20,679 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.20 vs. limit=15.0 2023-06-28 10:35:29,071 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2036706.0, ans=0.1 2023-06-28 10:35:32,605 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2036766.0, ans=0.125 2023-06-28 10:36:35,223 INFO [train.py:996] (2/4) Epoch 12, batch 4050, loss[loss=0.2015, simple_loss=0.2814, pruned_loss=0.06078, over 21852.00 frames. ], tot_loss[loss=0.2042, simple_loss=0.2814, pruned_loss=0.06355, over 4274027.29 frames. ], batch size: 371, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:36:46,355 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.02 vs. limit=10.0 2023-06-28 10:36:47,728 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=2036946.0, ans=0.2 2023-06-28 10:37:13,705 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=2037066.0, ans=0.125 2023-06-28 10:37:19,218 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.07 vs. limit=15.0 2023-06-28 10:38:18,408 INFO [train.py:996] (2/4) Epoch 12, batch 4100, loss[loss=0.1962, simple_loss=0.2783, pruned_loss=0.05708, over 21266.00 frames. ], tot_loss[loss=0.2065, simple_loss=0.2841, pruned_loss=0.06444, over 4275933.31 frames. ], batch size: 159, lr: 2.46e-03, grad_scale: 8.0 2023-06-28 10:38:39,718 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.03 vs. limit=22.5 2023-06-28 10:38:45,581 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.840e+02 7.700e+02 1.227e+03 1.924e+03 4.359e+03, threshold=2.455e+03, percent-clipped=14.0 2023-06-28 10:40:05,782 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=2037546.0, ans=0.0 2023-06-28 10:40:06,882 INFO [train.py:996] (2/4) Epoch 12, batch 4150, loss[loss=0.1896, simple_loss=0.2703, pruned_loss=0.05451, over 21053.00 frames. ], tot_loss[loss=0.2049, simple_loss=0.284, pruned_loss=0.06288, over 4277253.22 frames. ], batch size: 143, lr: 2.46e-03, grad_scale: 8.0 2023-06-28 10:40:12,648 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2037546.0, ans=0.125 2023-06-28 10:41:26,972 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.36 vs. limit=15.0 2023-06-28 10:41:46,396 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2037786.0, ans=0.1 2023-06-28 10:41:51,254 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2037846.0, ans=0.125 2023-06-28 10:41:52,383 INFO [train.py:996] (2/4) Epoch 12, batch 4200, loss[loss=0.1787, simple_loss=0.2533, pruned_loss=0.05204, over 21482.00 frames. ], tot_loss[loss=0.2043, simple_loss=0.2846, pruned_loss=0.06204, over 4275868.57 frames. ], batch size: 230, lr: 2.46e-03, grad_scale: 8.0 2023-06-28 10:42:14,626 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.777e+02 8.293e+02 1.484e+03 2.185e+03 3.637e+03, threshold=2.967e+03, percent-clipped=18.0 2023-06-28 10:42:41,686 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=2037966.0, ans=0.04949747468305833 2023-06-28 10:43:35,031 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.69 vs. limit=15.0 2023-06-28 10:43:37,190 INFO [train.py:996] (2/4) Epoch 12, batch 4250, loss[loss=0.2355, simple_loss=0.322, pruned_loss=0.07453, over 21760.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.2925, pruned_loss=0.06494, over 4273421.08 frames. ], batch size: 351, lr: 2.46e-03, grad_scale: 8.0 2023-06-28 10:43:39,576 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2038146.0, ans=0.125 2023-06-28 10:43:39,603 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2038146.0, ans=0.125 2023-06-28 10:43:44,591 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=2038146.0, ans=0.125 2023-06-28 10:44:09,986 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=2038206.0, ans=0.07 2023-06-28 10:44:19,752 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2038266.0, ans=0.125 2023-06-28 10:44:38,260 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.33 vs. limit=6.0 2023-06-28 10:44:43,204 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=8.33 vs. limit=22.5 2023-06-28 10:45:00,803 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.24 vs. limit=15.0 2023-06-28 10:45:05,718 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=2038386.0, ans=0.05 2023-06-28 10:45:16,322 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.63 vs. limit=15.0 2023-06-28 10:45:24,221 INFO [train.py:996] (2/4) Epoch 12, batch 4300, loss[loss=0.2214, simple_loss=0.3298, pruned_loss=0.05646, over 21627.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.2984, pruned_loss=0.06668, over 4272200.58 frames. ], batch size: 389, lr: 2.46e-03, grad_scale: 8.0 2023-06-28 10:45:31,798 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.45 vs. limit=15.0 2023-06-28 10:45:33,014 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2038446.0, ans=0.1 2023-06-28 10:46:00,802 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.611e+02 9.355e+02 1.305e+03 1.983e+03 5.098e+03, threshold=2.609e+03, percent-clipped=8.0 2023-06-28 10:46:01,368 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 10:46:29,862 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff3.min_abs, batch_count=2038566.0, ans=0.2 2023-06-28 10:46:59,775 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2038686.0, ans=0.1 2023-06-28 10:47:12,496 INFO [train.py:996] (2/4) Epoch 12, batch 4350, loss[loss=0.2258, simple_loss=0.2999, pruned_loss=0.07589, over 21506.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2984, pruned_loss=0.06626, over 4272626.84 frames. ], batch size: 441, lr: 2.46e-03, grad_scale: 8.0 2023-06-28 10:47:24,754 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=2038746.0, ans=0.0 2023-06-28 10:48:01,741 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=2038866.0, ans=0.04949747468305833 2023-06-28 10:48:01,851 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2038866.0, ans=0.125 2023-06-28 10:48:22,728 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2038926.0, ans=0.125 2023-06-28 10:48:24,184 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2038926.0, ans=0.1 2023-06-28 10:48:50,422 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2038986.0, ans=0.0 2023-06-28 10:49:03,199 INFO [train.py:996] (2/4) Epoch 12, batch 4400, loss[loss=0.2365, simple_loss=0.3286, pruned_loss=0.07225, over 21574.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.2945, pruned_loss=0.06535, over 4260094.11 frames. ], batch size: 441, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:49:35,015 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.000e+02 1.052e+03 1.456e+03 1.843e+03 4.869e+03, threshold=2.912e+03, percent-clipped=14.0 2023-06-28 10:49:41,858 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.90 vs. limit=22.5 2023-06-28 10:49:49,988 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=3.309e-03 2023-06-28 10:50:03,606 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=2039226.0, ans=0.0 2023-06-28 10:50:03,632 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2039226.0, ans=0.1 2023-06-28 10:50:53,927 INFO [train.py:996] (2/4) Epoch 12, batch 4450, loss[loss=0.2286, simple_loss=0.3185, pruned_loss=0.06935, over 21720.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.3027, pruned_loss=0.06729, over 4267098.58 frames. ], batch size: 263, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:50:56,190 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=2039346.0, ans=0.0 2023-06-28 10:51:04,760 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=2039346.0, ans=0.125 2023-06-28 10:51:12,790 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2039406.0, ans=0.1 2023-06-28 10:51:31,213 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2039466.0, ans=0.125 2023-06-28 10:52:38,112 INFO [train.py:996] (2/4) Epoch 12, batch 4500, loss[loss=0.2178, simple_loss=0.2966, pruned_loss=0.06948, over 21908.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.3023, pruned_loss=0.0683, over 4268165.54 frames. ], batch size: 107, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:52:38,845 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=2039646.0, ans=0.2 2023-06-28 10:53:04,865 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.459e+02 9.304e+02 1.246e+03 2.301e+03 3.917e+03, threshold=2.492e+03, percent-clipped=11.0 2023-06-28 10:53:06,076 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.40 vs. limit=15.0 2023-06-28 10:53:12,665 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2039706.0, ans=0.0 2023-06-28 10:53:28,506 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=2039766.0, ans=0.0 2023-06-28 10:54:28,127 INFO [train.py:996] (2/4) Epoch 12, batch 4550, loss[loss=0.2397, simple_loss=0.3261, pruned_loss=0.07665, over 21449.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.3049, pruned_loss=0.06842, over 4275581.01 frames. ], batch size: 131, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:56:11,336 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2040186.0, ans=0.125 2023-06-28 10:56:11,422 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=2040186.0, ans=0.0 2023-06-28 10:56:14,154 INFO [train.py:996] (2/4) Epoch 12, batch 4600, loss[loss=0.2468, simple_loss=0.3237, pruned_loss=0.085, over 21312.00 frames. ], tot_loss[loss=0.2215, simple_loss=0.3048, pruned_loss=0.06915, over 4280785.36 frames. ], batch size: 143, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:56:36,689 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.360e+02 7.467e+02 1.139e+03 1.677e+03 2.825e+03, threshold=2.277e+03, percent-clipped=5.0 2023-06-28 10:57:10,132 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=2040366.0, ans=0.125 2023-06-28 10:57:58,184 INFO [train.py:996] (2/4) Epoch 12, batch 4650, loss[loss=0.2025, simple_loss=0.317, pruned_loss=0.04395, over 20914.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.3006, pruned_loss=0.06821, over 4285921.50 frames. ], batch size: 608, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:58:03,765 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=2040546.0, ans=0.125 2023-06-28 10:58:26,780 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2040606.0, ans=0.1 2023-06-28 10:59:03,665 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.38 vs. limit=6.0 2023-06-28 10:59:16,599 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=2040726.0, ans=0.5 2023-06-28 10:59:40,594 INFO [train.py:996] (2/4) Epoch 12, batch 4700, loss[loss=0.1856, simple_loss=0.2525, pruned_loss=0.05933, over 21607.00 frames. ], tot_loss[loss=0.213, simple_loss=0.2932, pruned_loss=0.06647, over 4286235.07 frames. ], batch size: 247, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 11:00:07,737 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.876e+02 7.682e+02 1.181e+03 1.934e+03 4.585e+03, threshold=2.362e+03, percent-clipped=15.0 2023-06-28 11:00:08,317 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2040906.0, ans=0.1 2023-06-28 11:01:13,854 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=2041086.0, ans=0.125 2023-06-28 11:01:23,168 INFO [train.py:996] (2/4) Epoch 12, batch 4750, loss[loss=0.2372, simple_loss=0.3723, pruned_loss=0.05103, over 20753.00 frames. ], tot_loss[loss=0.2092, simple_loss=0.2873, pruned_loss=0.0656, over 4287951.46 frames. ], batch size: 607, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 11:01:57,071 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=2041206.0, ans=0.05 2023-06-28 11:03:05,688 INFO [train.py:996] (2/4) Epoch 12, batch 4800, loss[loss=0.1992, simple_loss=0.2789, pruned_loss=0.05971, over 21319.00 frames. ], tot_loss[loss=0.209, simple_loss=0.2862, pruned_loss=0.06589, over 4292621.97 frames. ], batch size: 159, lr: 2.46e-03, grad_scale: 32.0 2023-06-28 11:03:31,259 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2041506.0, ans=0.1 2023-06-28 11:03:32,403 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.216e+02 8.084e+02 1.278e+03 1.855e+03 4.015e+03, threshold=2.556e+03, percent-clipped=12.0 2023-06-28 11:04:17,531 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.16 vs. limit=12.0 2023-06-28 11:04:20,440 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2041626.0, ans=0.125 2023-06-28 11:04:47,270 INFO [train.py:996] (2/4) Epoch 12, batch 4850, loss[loss=0.2088, simple_loss=0.2859, pruned_loss=0.0659, over 21835.00 frames. ], tot_loss[loss=0.2094, simple_loss=0.2878, pruned_loss=0.06549, over 4297931.53 frames. ], batch size: 332, lr: 2.46e-03, grad_scale: 32.0 2023-06-28 11:04:58,491 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.87 vs. limit=22.5 2023-06-28 11:05:21,344 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=2041806.0, ans=0.0 2023-06-28 11:05:40,783 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=2041866.0, ans=0.04949747468305833 2023-06-28 11:06:30,301 INFO [train.py:996] (2/4) Epoch 12, batch 4900, loss[loss=0.2201, simple_loss=0.2926, pruned_loss=0.07374, over 16694.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.2894, pruned_loss=0.06653, over 4286966.10 frames. ], batch size: 63, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 11:06:37,595 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2042046.0, ans=0.0 2023-06-28 11:06:58,462 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.185e+02 7.394e+02 1.193e+03 1.925e+03 4.019e+03, threshold=2.386e+03, percent-clipped=10.0 2023-06-28 11:06:59,659 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.41 vs. limit=15.0 2023-06-28 11:07:11,351 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.95 vs. limit=22.5 2023-06-28 11:07:43,015 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2042226.0, ans=0.125 2023-06-28 11:07:44,446 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2042226.0, ans=0.0 2023-06-28 11:07:54,768 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.55 vs. limit=12.0 2023-06-28 11:08:14,027 INFO [train.py:996] (2/4) Epoch 12, batch 4950, loss[loss=0.1909, simple_loss=0.2686, pruned_loss=0.05655, over 20823.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.2926, pruned_loss=0.06507, over 4279183.63 frames. ], batch size: 607, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:08:30,857 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=2042346.0, ans=0.0 2023-06-28 11:09:01,919 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=2042466.0, ans=0.2 2023-06-28 11:09:54,802 INFO [train.py:996] (2/4) Epoch 12, batch 5000, loss[loss=0.2005, simple_loss=0.2817, pruned_loss=0.0596, over 21863.00 frames. ], tot_loss[loss=0.2078, simple_loss=0.2912, pruned_loss=0.06224, over 4281032.82 frames. ], batch size: 351, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:10:22,975 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.667e+02 7.098e+02 1.009e+03 1.573e+03 3.184e+03, threshold=2.017e+03, percent-clipped=11.0 2023-06-28 11:10:29,622 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=2042706.0, ans=0.0 2023-06-28 11:10:32,742 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=2042706.0, ans=0.0 2023-06-28 11:10:47,517 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=2042766.0, ans=0.2 2023-06-28 11:11:29,608 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=2042886.0, ans=0.04949747468305833 2023-06-28 11:11:35,595 INFO [train.py:996] (2/4) Epoch 12, batch 5050, loss[loss=0.2215, simple_loss=0.2961, pruned_loss=0.07343, over 21351.00 frames. ], tot_loss[loss=0.2092, simple_loss=0.2907, pruned_loss=0.06388, over 4288427.15 frames. ], batch size: 176, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:11:36,709 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.51 vs. limit=12.0 2023-06-28 11:11:41,009 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2042946.0, ans=0.0 2023-06-28 11:12:00,867 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=2043006.0, ans=0.2 2023-06-28 11:12:46,582 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2043126.0, ans=0.0 2023-06-28 11:12:51,382 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2043126.0, ans=0.125 2023-06-28 11:13:17,777 INFO [train.py:996] (2/4) Epoch 12, batch 5100, loss[loss=0.2213, simple_loss=0.2911, pruned_loss=0.07577, over 21370.00 frames. ], tot_loss[loss=0.2098, simple_loss=0.2896, pruned_loss=0.06501, over 4296945.47 frames. ], batch size: 143, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:13:26,343 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=2043246.0, ans=0.07 2023-06-28 11:13:28,074 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 11:13:45,413 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.047e+02 7.915e+02 1.019e+03 1.431e+03 3.420e+03, threshold=2.039e+03, percent-clipped=6.0 2023-06-28 11:14:12,514 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=2043366.0, ans=0.5 2023-06-28 11:14:14,330 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 11:14:33,822 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=2043426.0, ans=0.125 2023-06-28 11:14:52,500 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2043486.0, ans=0.1 2023-06-28 11:15:00,440 INFO [train.py:996] (2/4) Epoch 12, batch 5150, loss[loss=0.2007, simple_loss=0.2729, pruned_loss=0.06427, over 21639.00 frames. ], tot_loss[loss=0.209, simple_loss=0.287, pruned_loss=0.06549, over 4299377.97 frames. ], batch size: 230, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:16:07,835 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=2043726.0, ans=0.2 2023-06-28 11:16:34,906 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=2043786.0, ans=0.0 2023-06-28 11:16:44,556 INFO [train.py:996] (2/4) Epoch 12, batch 5200, loss[loss=0.2572, simple_loss=0.3536, pruned_loss=0.08037, over 21705.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.2895, pruned_loss=0.06665, over 4291415.84 frames. ], batch size: 414, lr: 2.45e-03, grad_scale: 32.0 2023-06-28 11:17:18,856 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.436e+02 7.444e+02 1.331e+03 2.729e+03 6.291e+03, threshold=2.663e+03, percent-clipped=30.0 2023-06-28 11:17:32,771 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2043966.0, ans=0.125 2023-06-28 11:18:12,167 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=2044086.0, ans=0.2 2023-06-28 11:18:26,592 INFO [train.py:996] (2/4) Epoch 12, batch 5250, loss[loss=0.2047, simple_loss=0.2815, pruned_loss=0.06399, over 21874.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.2922, pruned_loss=0.06504, over 4284129.03 frames. ], batch size: 107, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:19:31,334 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.64 vs. limit=15.0 2023-06-28 11:20:08,082 INFO [train.py:996] (2/4) Epoch 12, batch 5300, loss[loss=0.2079, simple_loss=0.2831, pruned_loss=0.06632, over 21933.00 frames. ], tot_loss[loss=0.212, simple_loss=0.2925, pruned_loss=0.06575, over 4289364.71 frames. ], batch size: 333, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:20:19,872 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2044446.0, ans=0.125 2023-06-28 11:20:42,498 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.313e+02 7.509e+02 1.039e+03 1.571e+03 3.451e+03, threshold=2.078e+03, percent-clipped=7.0 2023-06-28 11:20:51,330 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2044566.0, ans=0.0 2023-06-28 11:21:07,476 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 11:21:25,158 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=2044626.0, ans=0.2 2023-06-28 11:21:26,969 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=2044626.0, ans=0.2 2023-06-28 11:21:28,828 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.35 vs. limit=15.0 2023-06-28 11:21:39,577 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2044686.0, ans=0.125 2023-06-28 11:21:48,594 INFO [train.py:996] (2/4) Epoch 12, batch 5350, loss[loss=0.221, simple_loss=0.288, pruned_loss=0.07696, over 21941.00 frames. ], tot_loss[loss=0.2115, simple_loss=0.2905, pruned_loss=0.0662, over 4294091.88 frames. ], batch size: 333, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:22:01,864 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2044746.0, ans=0.0 2023-06-28 11:22:03,936 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2044746.0, ans=0.1 2023-06-28 11:22:49,635 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=2044866.0, ans=0.09899494936611666 2023-06-28 11:23:23,265 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2044986.0, ans=0.1 2023-06-28 11:23:32,796 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2044986.0, ans=0.125 2023-06-28 11:23:35,448 INFO [train.py:996] (2/4) Epoch 12, batch 5400, loss[loss=0.1948, simple_loss=0.2818, pruned_loss=0.05393, over 21695.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.2892, pruned_loss=0.06683, over 4293755.00 frames. ], batch size: 441, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:24:05,989 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.403e+02 8.319e+02 1.196e+03 1.782e+03 3.222e+03, threshold=2.392e+03, percent-clipped=18.0 2023-06-28 11:24:10,122 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=2045106.0, ans=0.0 2023-06-28 11:24:25,315 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=2045166.0, ans=0.0 2023-06-28 11:24:56,604 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2045286.0, ans=0.0 2023-06-28 11:25:15,747 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.59 vs. limit=15.0 2023-06-28 11:25:19,471 INFO [train.py:996] (2/4) Epoch 12, batch 5450, loss[loss=0.2347, simple_loss=0.332, pruned_loss=0.06869, over 21276.00 frames. ], tot_loss[loss=0.2101, simple_loss=0.2894, pruned_loss=0.06543, over 4297849.65 frames. ], batch size: 548, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:25:26,274 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2045346.0, ans=0.125 2023-06-28 11:26:12,663 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=2045466.0, ans=0.125 2023-06-28 11:26:14,175 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=2045466.0, ans=0.05 2023-06-28 11:26:19,330 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2045466.0, ans=0.125 2023-06-28 11:27:08,766 INFO [train.py:996] (2/4) Epoch 12, batch 5500, loss[loss=0.2245, simple_loss=0.3316, pruned_loss=0.05875, over 21680.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.2959, pruned_loss=0.06316, over 4296454.82 frames. ], batch size: 414, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:27:13,191 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=2045646.0, ans=0.0 2023-06-28 11:27:31,386 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2045706.0, ans=0.125 2023-06-28 11:27:44,012 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.944e+02 8.580e+02 1.207e+03 1.863e+03 4.637e+03, threshold=2.413e+03, percent-clipped=15.0 2023-06-28 11:28:04,830 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=2045766.0, ans=0.125 2023-06-28 11:28:12,090 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.80 vs. limit=22.5 2023-06-28 11:28:40,315 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2045886.0, ans=0.125 2023-06-28 11:28:57,718 INFO [train.py:996] (2/4) Epoch 12, batch 5550, loss[loss=0.1823, simple_loss=0.2818, pruned_loss=0.04138, over 21760.00 frames. ], tot_loss[loss=0.209, simple_loss=0.296, pruned_loss=0.06095, over 4290500.75 frames. ], batch size: 371, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:29:14,169 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.21 vs. limit=15.0 2023-06-28 11:30:35,773 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2046186.0, ans=0.125 2023-06-28 11:30:46,203 INFO [train.py:996] (2/4) Epoch 12, batch 5600, loss[loss=0.2316, simple_loss=0.3317, pruned_loss=0.06577, over 21707.00 frames. ], tot_loss[loss=0.205, simple_loss=0.2932, pruned_loss=0.05845, over 4284092.95 frames. ], batch size: 298, lr: 2.45e-03, grad_scale: 32.0 2023-06-28 11:31:13,175 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.841e+02 9.110e+02 1.414e+03 2.313e+03 5.859e+03, threshold=2.829e+03, percent-clipped=23.0 2023-06-28 11:31:31,750 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.52 vs. limit=12.0 2023-06-28 11:31:32,758 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2046366.0, ans=0.125 2023-06-28 11:32:27,091 INFO [train.py:996] (2/4) Epoch 12, batch 5650, loss[loss=0.2011, simple_loss=0.2682, pruned_loss=0.06699, over 21251.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2973, pruned_loss=0.06175, over 4285897.78 frames. ], batch size: 608, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:32:31,107 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=2046546.0, ans=0.2 2023-06-28 11:32:32,527 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2046546.0, ans=0.125 2023-06-28 11:33:48,621 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=2046786.0, ans=0.0 2023-06-28 11:34:00,179 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=2046786.0, ans=0.2 2023-06-28 11:34:09,874 INFO [train.py:996] (2/4) Epoch 12, batch 5700, loss[loss=0.2009, simple_loss=0.3043, pruned_loss=0.04876, over 21237.00 frames. ], tot_loss[loss=0.2106, simple_loss=0.2958, pruned_loss=0.06265, over 4290888.06 frames. ], batch size: 548, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:34:42,201 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.668e+02 8.634e+02 1.270e+03 1.811e+03 3.578e+03, threshold=2.540e+03, percent-clipped=6.0 2023-06-28 11:35:08,729 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2046966.0, ans=0.125 2023-06-28 11:35:21,783 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=2047026.0, ans=0.125 2023-06-28 11:35:54,481 INFO [train.py:996] (2/4) Epoch 12, batch 5750, loss[loss=0.1775, simple_loss=0.2678, pruned_loss=0.04357, over 21802.00 frames. ], tot_loss[loss=0.2053, simple_loss=0.2916, pruned_loss=0.05954, over 4287680.50 frames. ], batch size: 371, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:36:27,487 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.30 vs. limit=22.5 2023-06-28 11:36:47,659 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.77 vs. limit=22.5 2023-06-28 11:37:05,559 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=2047326.0, ans=0.0 2023-06-28 11:37:07,255 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2047326.0, ans=0.0 2023-06-28 11:37:30,714 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=2047386.0, ans=0.125 2023-06-28 11:37:38,824 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=2047386.0, ans=0.125 2023-06-28 11:37:41,847 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=2047446.0, ans=0.035 2023-06-28 11:37:41,928 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=2047446.0, ans=0.09899494936611666 2023-06-28 11:37:43,053 INFO [train.py:996] (2/4) Epoch 12, batch 5800, loss[loss=0.1995, simple_loss=0.3003, pruned_loss=0.04933, over 21631.00 frames. ], tot_loss[loss=0.2038, simple_loss=0.2908, pruned_loss=0.05835, over 4289679.71 frames. ], batch size: 263, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:38:14,539 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.621e+02 6.881e+02 1.222e+03 1.758e+03 3.677e+03, threshold=2.444e+03, percent-clipped=11.0 2023-06-28 11:38:38,305 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2047566.0, ans=0.125 2023-06-28 11:38:52,708 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2047626.0, ans=0.125 2023-06-28 11:39:31,901 INFO [train.py:996] (2/4) Epoch 12, batch 5850, loss[loss=0.2367, simple_loss=0.3167, pruned_loss=0.07831, over 20157.00 frames. ], tot_loss[loss=0.2016, simple_loss=0.291, pruned_loss=0.05606, over 4282750.26 frames. ], batch size: 702, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:39:35,845 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=2047746.0, ans=0.125 2023-06-28 11:39:58,966 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=2047806.0, ans=0.0 2023-06-28 11:40:17,038 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=2047866.0, ans=0.2 2023-06-28 11:41:00,165 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.44 vs. limit=15.0 2023-06-28 11:41:02,927 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=2047986.0, ans=0.5 2023-06-28 11:41:08,980 INFO [train.py:996] (2/4) Epoch 12, batch 5900, loss[loss=0.1429, simple_loss=0.2453, pruned_loss=0.0203, over 21822.00 frames. ], tot_loss[loss=0.1933, simple_loss=0.2838, pruned_loss=0.05137, over 4287818.37 frames. ], batch size: 316, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:41:22,068 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2048046.0, ans=0.125 2023-06-28 11:41:25,636 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2048046.0, ans=0.125 2023-06-28 11:41:44,138 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.278e+02 9.930e+02 1.759e+03 2.367e+03 3.954e+03, threshold=3.519e+03, percent-clipped=21.0 2023-06-28 11:41:59,949 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.94 vs. limit=22.5 2023-06-28 11:42:10,954 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=2048226.0, ans=0.025 2023-06-28 11:42:54,194 INFO [train.py:996] (2/4) Epoch 12, batch 5950, loss[loss=0.1649, simple_loss=0.2366, pruned_loss=0.04666, over 21627.00 frames. ], tot_loss[loss=0.1963, simple_loss=0.284, pruned_loss=0.05425, over 4293479.13 frames. ], batch size: 263, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:43:27,813 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=2048406.0, ans=0.0 2023-06-28 11:43:31,888 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.38 vs. limit=12.0 2023-06-28 11:43:37,979 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=2048466.0, ans=0.0 2023-06-28 11:43:46,241 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=2048466.0, ans=0.2 2023-06-28 11:44:36,736 INFO [train.py:996] (2/4) Epoch 12, batch 6000, loss[loss=0.1849, simple_loss=0.2493, pruned_loss=0.06024, over 21746.00 frames. ], tot_loss[loss=0.1968, simple_loss=0.28, pruned_loss=0.05679, over 4283419.21 frames. ], batch size: 300, lr: 2.45e-03, grad_scale: 32.0 2023-06-28 11:44:36,737 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-28 11:44:57,251 INFO [train.py:1028] (2/4) Epoch 12, validation: loss=0.2597, simple_loss=0.3509, pruned_loss=0.08424, over 1796401.00 frames. 2023-06-28 11:44:57,252 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 23793MB 2023-06-28 11:45:28,553 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.837e+02 9.369e+02 1.291e+03 2.028e+03 3.757e+03, threshold=2.582e+03, percent-clipped=1.0 2023-06-28 11:45:32,906 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=2048706.0, ans=0.07 2023-06-28 11:45:47,657 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2048766.0, ans=0.125 2023-06-28 11:46:26,028 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2048886.0, ans=0.125 2023-06-28 11:46:38,957 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2048946.0, ans=0.125 2023-06-28 11:46:38,963 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2048946.0, ans=0.125 2023-06-28 11:46:40,049 INFO [train.py:996] (2/4) Epoch 12, batch 6050, loss[loss=0.1923, simple_loss=0.2602, pruned_loss=0.06218, over 22025.00 frames. ], tot_loss[loss=0.1949, simple_loss=0.2749, pruned_loss=0.05742, over 4280108.45 frames. ], batch size: 103, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:47:03,229 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.58 vs. limit=15.0 2023-06-28 11:47:04,861 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.98 vs. limit=22.5 2023-06-28 11:47:27,312 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=2049066.0, ans=0.125 2023-06-28 11:47:38,710 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=2049126.0, ans=0.0 2023-06-28 11:48:19,064 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2049186.0, ans=0.0 2023-06-28 11:48:28,629 INFO [train.py:996] (2/4) Epoch 12, batch 6100, loss[loss=0.1886, simple_loss=0.2712, pruned_loss=0.05297, over 21735.00 frames. ], tot_loss[loss=0.194, simple_loss=0.2745, pruned_loss=0.05674, over 4270388.93 frames. ], batch size: 247, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:48:57,066 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.586e+02 8.433e+02 1.328e+03 2.179e+03 5.742e+03, threshold=2.657e+03, percent-clipped=17.0 2023-06-28 11:48:57,742 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=2049306.0, ans=0.0 2023-06-28 11:49:05,013 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2049306.0, ans=0.125 2023-06-28 11:49:15,561 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.43 vs. limit=15.0 2023-06-28 11:49:37,934 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=2049426.0, ans=0.2 2023-06-28 11:49:45,014 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.82 vs. limit=10.0 2023-06-28 11:50:13,329 INFO [train.py:996] (2/4) Epoch 12, batch 6150, loss[loss=0.215, simple_loss=0.2819, pruned_loss=0.0741, over 21901.00 frames. ], tot_loss[loss=0.1968, simple_loss=0.2761, pruned_loss=0.05875, over 4284019.92 frames. ], batch size: 98, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:51:01,977 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=2049666.0, ans=0.0 2023-06-28 11:51:15,474 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.96 vs. limit=15.0 2023-06-28 11:51:30,044 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2049786.0, ans=0.125 2023-06-28 11:51:56,280 INFO [train.py:996] (2/4) Epoch 12, batch 6200, loss[loss=0.1961, simple_loss=0.2866, pruned_loss=0.0528, over 21696.00 frames. ], tot_loss[loss=0.1995, simple_loss=0.2792, pruned_loss=0.05991, over 4284174.79 frames. ], batch size: 247, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:52:16,710 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.56 vs. limit=6.0 2023-06-28 11:52:31,209 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=2049906.0, ans=0.2 2023-06-28 11:52:32,346 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.502e+02 7.829e+02 1.153e+03 1.728e+03 4.252e+03, threshold=2.307e+03, percent-clipped=8.0 2023-06-28 11:53:10,175 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2050026.0, ans=0.125 2023-06-28 11:53:17,047 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2050026.0, ans=0.125 2023-06-28 11:53:33,806 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2050086.0, ans=0.1 2023-06-28 11:53:41,398 INFO [train.py:996] (2/4) Epoch 12, batch 6250, loss[loss=0.2582, simple_loss=0.3556, pruned_loss=0.08037, over 21523.00 frames. ], tot_loss[loss=0.203, simple_loss=0.2852, pruned_loss=0.06045, over 4282488.78 frames. ], batch size: 471, lr: 2.45e-03, grad_scale: 8.0 2023-06-28 11:53:42,083 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2050146.0, ans=0.1 2023-06-28 11:54:06,799 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2050206.0, ans=0.125 2023-06-28 11:54:16,876 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=2050206.0, ans=0.035 2023-06-28 11:54:37,271 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.92 vs. limit=10.0 2023-06-28 11:55:09,852 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.68 vs. limit=15.0 2023-06-28 11:55:19,350 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2050386.0, ans=0.125 2023-06-28 11:55:23,840 INFO [train.py:996] (2/4) Epoch 12, batch 6300, loss[loss=0.2028, simple_loss=0.2683, pruned_loss=0.06859, over 20002.00 frames. ], tot_loss[loss=0.2041, simple_loss=0.2897, pruned_loss=0.05923, over 4281028.71 frames. ], batch size: 702, lr: 2.45e-03, grad_scale: 8.0 2023-06-28 11:56:03,346 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.650e+02 7.162e+02 1.070e+03 1.625e+03 2.845e+03, threshold=2.140e+03, percent-clipped=5.0 2023-06-28 11:57:05,258 INFO [train.py:996] (2/4) Epoch 12, batch 6350, loss[loss=0.2851, simple_loss=0.3462, pruned_loss=0.112, over 21504.00 frames. ], tot_loss[loss=0.2084, simple_loss=0.2917, pruned_loss=0.06257, over 4291382.74 frames. ], batch size: 471, lr: 2.45e-03, grad_scale: 8.0 2023-06-28 11:58:19,783 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=2050926.0, ans=10.0 2023-06-28 11:58:21,363 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.18 vs. limit=6.0 2023-06-28 11:58:54,058 INFO [train.py:996] (2/4) Epoch 12, batch 6400, loss[loss=0.2574, simple_loss=0.343, pruned_loss=0.0859, over 21794.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2974, pruned_loss=0.06665, over 4291520.21 frames. ], batch size: 124, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:58:55,498 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=2051046.0, ans=15.0 2023-06-28 11:59:20,098 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2051106.0, ans=0.1 2023-06-28 11:59:29,779 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.784e+02 8.222e+02 1.150e+03 1.542e+03 3.199e+03, threshold=2.299e+03, percent-clipped=10.0 2023-06-28 12:00:22,671 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2051286.0, ans=0.125 2023-06-28 12:00:33,946 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2051286.0, ans=0.1 2023-06-28 12:00:36,726 INFO [train.py:996] (2/4) Epoch 12, batch 6450, loss[loss=0.2047, simple_loss=0.2931, pruned_loss=0.05819, over 21561.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2987, pruned_loss=0.06602, over 4282228.61 frames. ], batch size: 389, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:00:38,215 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.59 vs. limit=10.0 2023-06-28 12:01:05,481 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 12:01:35,871 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2051466.0, ans=0.125 2023-06-28 12:02:04,048 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=2051586.0, ans=0.125 2023-06-28 12:02:16,017 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2051586.0, ans=0.0 2023-06-28 12:02:20,327 INFO [train.py:996] (2/4) Epoch 12, batch 6500, loss[loss=0.2266, simple_loss=0.2794, pruned_loss=0.0869, over 21247.00 frames. ], tot_loss[loss=0.2098, simple_loss=0.2906, pruned_loss=0.06446, over 4277693.38 frames. ], batch size: 471, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:02:59,805 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.006e+02 7.341e+02 1.379e+03 1.907e+03 4.704e+03, threshold=2.757e+03, percent-clipped=17.0 2023-06-28 12:03:49,099 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=2051886.0, ans=0.125 2023-06-28 12:03:54,499 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2051886.0, ans=0.125 2023-06-28 12:04:03,594 INFO [train.py:996] (2/4) Epoch 12, batch 6550, loss[loss=0.2398, simple_loss=0.3071, pruned_loss=0.08621, over 21723.00 frames. ], tot_loss[loss=0.2089, simple_loss=0.2893, pruned_loss=0.06423, over 4275567.94 frames. ], batch size: 441, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:04:15,555 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.15 vs. limit=12.0 2023-06-28 12:04:45,722 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2052006.0, ans=0.0 2023-06-28 12:05:44,407 INFO [train.py:996] (2/4) Epoch 12, batch 6600, loss[loss=0.1681, simple_loss=0.2332, pruned_loss=0.05146, over 21642.00 frames. ], tot_loss[loss=0.2055, simple_loss=0.2836, pruned_loss=0.06374, over 4277616.26 frames. ], batch size: 247, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:05:52,794 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=2052246.0, ans=0.0 2023-06-28 12:06:28,657 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.752e+02 7.717e+02 1.174e+03 1.589e+03 2.955e+03, threshold=2.349e+03, percent-clipped=1.0 2023-06-28 12:07:08,910 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2052486.0, ans=0.0 2023-06-28 12:07:25,927 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=2052486.0, ans=0.2 2023-06-28 12:07:32,096 INFO [train.py:996] (2/4) Epoch 12, batch 6650, loss[loss=0.1577, simple_loss=0.2531, pruned_loss=0.03113, over 21727.00 frames. ], tot_loss[loss=0.1994, simple_loss=0.2766, pruned_loss=0.06107, over 4275605.14 frames. ], batch size: 333, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:07:32,698 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2052546.0, ans=0.125 2023-06-28 12:07:39,488 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 12:08:13,246 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=2052666.0, ans=0.125 2023-06-28 12:09:07,126 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2052786.0, ans=0.125 2023-06-28 12:09:13,034 INFO [train.py:996] (2/4) Epoch 12, batch 6700, loss[loss=0.2191, simple_loss=0.3036, pruned_loss=0.06728, over 21099.00 frames. ], tot_loss[loss=0.1971, simple_loss=0.2728, pruned_loss=0.06069, over 4279556.84 frames. ], batch size: 608, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:09:21,026 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.89 vs. limit=22.5 2023-06-28 12:09:52,368 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.639e+02 7.163e+02 1.028e+03 1.473e+03 3.561e+03, threshold=2.056e+03, percent-clipped=9.0 2023-06-28 12:10:25,255 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 12:10:53,907 INFO [train.py:996] (2/4) Epoch 12, batch 6750, loss[loss=0.2058, simple_loss=0.2741, pruned_loss=0.06876, over 21805.00 frames. ], tot_loss[loss=0.1971, simple_loss=0.272, pruned_loss=0.06107, over 4287098.72 frames. ], batch size: 102, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:11:04,092 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2053146.0, ans=0.125 2023-06-28 12:11:16,962 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2053206.0, ans=0.0 2023-06-28 12:11:24,925 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2053206.0, ans=0.0 2023-06-28 12:12:13,666 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=2053386.0, ans=0.125 2023-06-28 12:12:33,674 INFO [train.py:996] (2/4) Epoch 12, batch 6800, loss[loss=0.1979, simple_loss=0.2649, pruned_loss=0.06546, over 21731.00 frames. ], tot_loss[loss=0.2002, simple_loss=0.2746, pruned_loss=0.06286, over 4278111.32 frames. ], batch size: 333, lr: 2.45e-03, grad_scale: 32.0 2023-06-28 12:13:13,852 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.991e+02 6.929e+02 1.207e+03 2.029e+03 5.012e+03, threshold=2.414e+03, percent-clipped=24.0 2023-06-28 12:13:36,072 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2053626.0, ans=0.1 2023-06-28 12:13:52,284 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=2053686.0, ans=0.0 2023-06-28 12:14:13,202 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2053746.0, ans=0.125 2023-06-28 12:14:14,443 INFO [train.py:996] (2/4) Epoch 12, batch 6850, loss[loss=0.2267, simple_loss=0.292, pruned_loss=0.08075, over 21427.00 frames. ], tot_loss[loss=0.2002, simple_loss=0.2726, pruned_loss=0.06387, over 4277626.57 frames. ], batch size: 159, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:14:20,587 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.63 vs. limit=15.0 2023-06-28 12:14:31,534 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=2053806.0, ans=0.2 2023-06-28 12:14:41,240 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2053806.0, ans=0.125 2023-06-28 12:15:12,117 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2053866.0, ans=0.1 2023-06-28 12:15:31,951 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=2053926.0, ans=0.125 2023-06-28 12:15:34,258 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2053926.0, ans=0.1 2023-06-28 12:15:37,584 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=2053986.0, ans=0.035 2023-06-28 12:15:58,205 INFO [train.py:996] (2/4) Epoch 12, batch 6900, loss[loss=0.2127, simple_loss=0.2901, pruned_loss=0.06762, over 21787.00 frames. ], tot_loss[loss=0.2002, simple_loss=0.2729, pruned_loss=0.06377, over 4283238.81 frames. ], batch size: 441, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:16:39,823 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.832e+02 6.265e+02 8.270e+02 1.384e+03 3.220e+03, threshold=1.654e+03, percent-clipped=7.0 2023-06-28 12:16:53,942 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=2054166.0, ans=0.05 2023-06-28 12:17:01,815 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2054226.0, ans=0.0 2023-06-28 12:17:05,411 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=2054226.0, ans=0.2 2023-06-28 12:17:38,716 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=2054286.0, ans=0.125 2023-06-28 12:17:45,889 INFO [train.py:996] (2/4) Epoch 12, batch 6950, loss[loss=0.215, simple_loss=0.2904, pruned_loss=0.06986, over 21211.00 frames. ], tot_loss[loss=0.1983, simple_loss=0.2746, pruned_loss=0.06097, over 4272954.19 frames. ], batch size: 159, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:17:48,095 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2054346.0, ans=0.0 2023-06-28 12:17:59,344 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=2054346.0, ans=0.0 2023-06-28 12:19:04,582 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2054526.0, ans=0.125 2023-06-28 12:19:28,451 INFO [train.py:996] (2/4) Epoch 12, batch 7000, loss[loss=0.194, simple_loss=0.2556, pruned_loss=0.0662, over 21241.00 frames. ], tot_loss[loss=0.2018, simple_loss=0.2778, pruned_loss=0.06287, over 4278498.75 frames. ], batch size: 159, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:19:40,853 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2054646.0, ans=0.0 2023-06-28 12:19:52,557 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2054706.0, ans=0.1 2023-06-28 12:20:05,474 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.352e+02 8.399e+02 1.085e+03 1.441e+03 2.628e+03, threshold=2.170e+03, percent-clipped=15.0 2023-06-28 12:20:46,868 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=2054826.0, ans=0.125 2023-06-28 12:21:03,966 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=2054886.0, ans=15.0 2023-06-28 12:21:14,866 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2054946.0, ans=0.125 2023-06-28 12:21:16,087 INFO [train.py:996] (2/4) Epoch 12, batch 7050, loss[loss=0.2241, simple_loss=0.3104, pruned_loss=0.06886, over 21447.00 frames. ], tot_loss[loss=0.2001, simple_loss=0.2761, pruned_loss=0.06208, over 4275192.27 frames. ], batch size: 471, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:21:20,042 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2054946.0, ans=0.125 2023-06-28 12:21:25,076 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2054946.0, ans=0.1 2023-06-28 12:22:28,173 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2055126.0, ans=0.125 2023-06-28 12:22:57,432 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=2055186.0, ans=0.0 2023-06-28 12:23:00,202 INFO [train.py:996] (2/4) Epoch 12, batch 7100, loss[loss=0.21, simple_loss=0.2939, pruned_loss=0.06311, over 21776.00 frames. ], tot_loss[loss=0.2031, simple_loss=0.2802, pruned_loss=0.06303, over 4276102.79 frames. ], batch size: 124, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:23:33,891 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=2055306.0, ans=0.2 2023-06-28 12:23:36,458 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.448e+02 7.505e+02 1.150e+03 1.796e+03 3.717e+03, threshold=2.300e+03, percent-clipped=14.0 2023-06-28 12:23:56,980 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2055366.0, ans=0.125 2023-06-28 12:24:36,358 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=2055486.0, ans=0.07 2023-06-28 12:24:39,685 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2055486.0, ans=0.125 2023-06-28 12:24:42,354 INFO [train.py:996] (2/4) Epoch 12, batch 7150, loss[loss=0.1912, simple_loss=0.2769, pruned_loss=0.05273, over 21820.00 frames. ], tot_loss[loss=0.1992, simple_loss=0.2775, pruned_loss=0.0605, over 4274945.46 frames. ], batch size: 282, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:25:18,558 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=2055606.0, ans=0.125 2023-06-28 12:26:08,049 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.91 vs. limit=15.0 2023-06-28 12:26:25,287 INFO [train.py:996] (2/4) Epoch 12, batch 7200, loss[loss=0.1874, simple_loss=0.258, pruned_loss=0.05845, over 21630.00 frames. ], tot_loss[loss=0.2055, simple_loss=0.2836, pruned_loss=0.06369, over 4268004.27 frames. ], batch size: 298, lr: 2.45e-03, grad_scale: 32.0 2023-06-28 12:26:49,586 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.50 vs. limit=15.0 2023-06-28 12:26:50,805 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2055906.0, ans=0.125 2023-06-28 12:26:54,323 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2055906.0, ans=0.125 2023-06-28 12:27:08,262 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.252e+02 8.659e+02 1.185e+03 1.756e+03 3.819e+03, threshold=2.369e+03, percent-clipped=13.0 2023-06-28 12:27:11,240 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.72 vs. limit=15.0 2023-06-28 12:27:31,117 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.12 vs. limit=15.0 2023-06-28 12:27:50,202 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2056086.0, ans=0.125 2023-06-28 12:28:12,250 INFO [train.py:996] (2/4) Epoch 12, batch 7250, loss[loss=0.1703, simple_loss=0.2254, pruned_loss=0.05762, over 21330.00 frames. ], tot_loss[loss=0.2035, simple_loss=0.2787, pruned_loss=0.06412, over 4264223.49 frames. ], batch size: 551, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:28:26,135 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=2056146.0, ans=0.0 2023-06-28 12:28:35,120 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.85 vs. limit=22.5 2023-06-28 12:28:51,125 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=2056266.0, ans=0.0 2023-06-28 12:29:32,568 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.28 vs. limit=6.0 2023-06-28 12:29:44,621 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=2056386.0, ans=0.0 2023-06-28 12:29:53,561 INFO [train.py:996] (2/4) Epoch 12, batch 7300, loss[loss=0.1793, simple_loss=0.2437, pruned_loss=0.05744, over 21460.00 frames. ], tot_loss[loss=0.2008, simple_loss=0.2737, pruned_loss=0.06392, over 4268791.23 frames. ], batch size: 195, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:30:05,547 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2056446.0, ans=0.125 2023-06-28 12:30:31,643 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.740e+02 7.948e+02 1.183e+03 1.586e+03 3.750e+03, threshold=2.367e+03, percent-clipped=12.0 2023-06-28 12:30:47,336 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=2056626.0, ans=0.0 2023-06-28 12:31:02,253 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2056626.0, ans=0.125 2023-06-28 12:31:02,265 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 12:31:04,232 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2056626.0, ans=0.125 2023-06-28 12:31:27,464 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=2056686.0, ans=0.125 2023-06-28 12:31:27,554 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2056686.0, ans=0.1 2023-06-28 12:31:31,966 INFO [train.py:996] (2/4) Epoch 12, batch 7350, loss[loss=0.2201, simple_loss=0.2959, pruned_loss=0.07215, over 21710.00 frames. ], tot_loss[loss=0.2, simple_loss=0.2716, pruned_loss=0.06419, over 4270543.38 frames. ], batch size: 332, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:31:43,366 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=2056746.0, ans=0.2 2023-06-28 12:32:02,154 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.74 vs. limit=15.0 2023-06-28 12:32:11,744 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2056866.0, ans=0.0 2023-06-28 12:32:41,281 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.05 vs. limit=15.0 2023-06-28 12:32:59,681 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=2056986.0, ans=0.0 2023-06-28 12:33:09,729 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=2056986.0, ans=0.5 2023-06-28 12:33:11,449 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=2056986.0, ans=0.04949747468305833 2023-06-28 12:33:17,325 INFO [train.py:996] (2/4) Epoch 12, batch 7400, loss[loss=0.2405, simple_loss=0.3296, pruned_loss=0.07569, over 21458.00 frames. ], tot_loss[loss=0.2046, simple_loss=0.2782, pruned_loss=0.06548, over 4269283.31 frames. ], batch size: 471, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:33:22,807 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2057046.0, ans=0.125 2023-06-28 12:33:29,499 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=2057046.0, ans=0.025 2023-06-28 12:33:53,537 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.69 vs. limit=22.5 2023-06-28 12:33:56,332 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=2057106.0, ans=0.125 2023-06-28 12:34:05,872 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.109e+02 7.290e+02 9.953e+02 1.415e+03 2.956e+03, threshold=1.991e+03, percent-clipped=1.0 2023-06-28 12:35:00,569 INFO [train.py:996] (2/4) Epoch 12, batch 7450, loss[loss=0.183, simple_loss=0.2576, pruned_loss=0.05414, over 21669.00 frames. ], tot_loss[loss=0.2047, simple_loss=0.2791, pruned_loss=0.06511, over 4270889.85 frames. ], batch size: 333, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:35:19,540 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=2057346.0, ans=0.0 2023-06-28 12:35:21,167 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2057406.0, ans=0.0 2023-06-28 12:36:49,976 INFO [train.py:996] (2/4) Epoch 12, batch 7500, loss[loss=0.2706, simple_loss=0.3724, pruned_loss=0.08435, over 21640.00 frames. ], tot_loss[loss=0.2082, simple_loss=0.2834, pruned_loss=0.06646, over 4269761.78 frames. ], batch size: 414, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:37:33,886 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.996e+02 7.365e+02 1.053e+03 1.699e+03 4.084e+03, threshold=2.105e+03, percent-clipped=21.0 2023-06-28 12:37:38,665 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.48 vs. limit=15.0 2023-06-28 12:38:21,457 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=2057886.0, ans=0.04949747468305833 2023-06-28 12:38:34,119 INFO [train.py:996] (2/4) Epoch 12, batch 7550, loss[loss=0.2086, simple_loss=0.3098, pruned_loss=0.05367, over 21656.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.2902, pruned_loss=0.06514, over 4272415.69 frames. ], batch size: 414, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:38:34,877 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2057946.0, ans=0.125 2023-06-28 12:39:34,868 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2058066.0, ans=0.1 2023-06-28 12:39:51,485 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.05 vs. limit=15.0 2023-06-28 12:39:59,545 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2058186.0, ans=0.1 2023-06-28 12:40:16,307 INFO [train.py:996] (2/4) Epoch 12, batch 7600, loss[loss=0.215, simple_loss=0.2976, pruned_loss=0.06617, over 22070.00 frames. ], tot_loss[loss=0.208, simple_loss=0.2878, pruned_loss=0.06413, over 4277075.81 frames. ], batch size: 119, lr: 2.45e-03, grad_scale: 32.0 2023-06-28 12:40:18,506 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=2058246.0, ans=0.125 2023-06-28 12:40:26,748 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=2058246.0, ans=0.125 2023-06-28 12:40:54,336 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2058306.0, ans=0.0 2023-06-28 12:40:58,888 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.988e+02 7.986e+02 1.163e+03 1.762e+03 3.955e+03, threshold=2.326e+03, percent-clipped=12.0 2023-06-28 12:41:21,897 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=2058426.0, ans=0.2 2023-06-28 12:41:57,889 INFO [train.py:996] (2/4) Epoch 12, batch 7650, loss[loss=0.2043, simple_loss=0.2731, pruned_loss=0.06774, over 22022.00 frames. ], tot_loss[loss=0.2088, simple_loss=0.2872, pruned_loss=0.06523, over 4280577.98 frames. ], batch size: 300, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:42:38,670 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2058606.0, ans=0.1 2023-06-28 12:43:44,203 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=2058786.0, ans=15.0 2023-06-28 12:43:46,554 INFO [train.py:996] (2/4) Epoch 12, batch 7700, loss[loss=0.1978, simple_loss=0.2557, pruned_loss=0.06989, over 20173.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.2897, pruned_loss=0.06754, over 4287027.37 frames. ], batch size: 702, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:44:09,188 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2058906.0, ans=0.1 2023-06-28 12:44:09,240 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2058906.0, ans=0.125 2023-06-28 12:44:27,729 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=2058906.0, ans=0.04949747468305833 2023-06-28 12:44:31,977 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.981e+02 7.414e+02 1.157e+03 1.590e+03 5.387e+03, threshold=2.314e+03, percent-clipped=8.0 2023-06-28 12:44:45,056 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2058966.0, ans=0.1 2023-06-28 12:45:36,624 INFO [train.py:996] (2/4) Epoch 12, batch 7750, loss[loss=0.2464, simple_loss=0.3488, pruned_loss=0.07201, over 21739.00 frames. ], tot_loss[loss=0.215, simple_loss=0.2953, pruned_loss=0.06734, over 4287207.23 frames. ], batch size: 351, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 12:45:48,373 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.62 vs. limit=15.0 2023-06-28 12:46:06,643 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=2059206.0, ans=0.125 2023-06-28 12:46:30,173 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2059266.0, ans=0.0 2023-06-28 12:46:39,207 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.61 vs. limit=10.0 2023-06-28 12:47:21,175 INFO [train.py:996] (2/4) Epoch 12, batch 7800, loss[loss=0.2239, simple_loss=0.3534, pruned_loss=0.04723, over 19792.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2965, pruned_loss=0.06746, over 4274547.76 frames. ], batch size: 702, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 12:47:31,449 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2059446.0, ans=0.125 2023-06-28 12:47:40,970 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2059506.0, ans=0.125 2023-06-28 12:47:43,109 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=6.98 vs. limit=22.5 2023-06-28 12:47:52,417 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2059506.0, ans=0.0 2023-06-28 12:48:00,037 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.542e+02 9.199e+02 1.440e+03 2.477e+03 5.669e+03, threshold=2.881e+03, percent-clipped=30.0 2023-06-28 12:48:03,059 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.50 vs. limit=22.5 2023-06-28 12:48:03,883 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=2059566.0, ans=0.2 2023-06-28 12:48:05,931 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=2059566.0, ans=0.125 2023-06-28 12:48:16,581 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.06 vs. limit=22.5 2023-06-28 12:48:55,742 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2059686.0, ans=0.125 2023-06-28 12:49:03,637 INFO [train.py:996] (2/4) Epoch 12, batch 7850, loss[loss=0.2023, simple_loss=0.2684, pruned_loss=0.06812, over 21557.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.2916, pruned_loss=0.06695, over 4263747.91 frames. ], batch size: 391, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 12:49:09,407 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 12:49:36,310 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 12:50:07,243 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=14.43 vs. limit=15.0 2023-06-28 12:50:48,129 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2060046.0, ans=0.0 2023-06-28 12:50:49,207 INFO [train.py:996] (2/4) Epoch 12, batch 7900, loss[loss=0.1902, simple_loss=0.252, pruned_loss=0.06419, over 21460.00 frames. ], tot_loss[loss=0.2092, simple_loss=0.2867, pruned_loss=0.06582, over 4262768.34 frames. ], batch size: 212, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 12:51:12,434 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=2060106.0, ans=0.125 2023-06-28 12:51:30,563 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.553e+02 9.216e+02 1.431e+03 2.035e+03 3.808e+03, threshold=2.862e+03, percent-clipped=8.0 2023-06-28 12:51:53,337 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.93 vs. limit=6.0 2023-06-28 12:52:07,820 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.26 vs. limit=10.0 2023-06-28 12:52:14,348 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2060226.0, ans=0.1 2023-06-28 12:52:15,938 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=2060286.0, ans=0.2 2023-06-28 12:52:38,423 INFO [train.py:996] (2/4) Epoch 12, batch 7950, loss[loss=0.2082, simple_loss=0.2943, pruned_loss=0.06107, over 21813.00 frames. ], tot_loss[loss=0.2106, simple_loss=0.2902, pruned_loss=0.0655, over 4266094.69 frames. ], batch size: 247, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 12:52:43,860 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2060346.0, ans=0.0 2023-06-28 12:53:44,951 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2060526.0, ans=0.125 2023-06-28 12:54:24,575 INFO [train.py:996] (2/4) Epoch 12, batch 8000, loss[loss=0.2388, simple_loss=0.3347, pruned_loss=0.07147, over 21642.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2959, pruned_loss=0.06822, over 4265533.04 frames. ], batch size: 414, lr: 2.44e-03, grad_scale: 32.0 2023-06-28 12:54:37,522 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2060646.0, ans=0.125 2023-06-28 12:55:18,765 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.842e+02 9.882e+02 1.672e+03 2.798e+03 5.114e+03, threshold=3.344e+03, percent-clipped=23.0 2023-06-28 12:56:16,344 INFO [train.py:996] (2/4) Epoch 12, batch 8050, loss[loss=0.176, simple_loss=0.2352, pruned_loss=0.05842, over 21430.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.2988, pruned_loss=0.06838, over 4261697.40 frames. ], batch size: 131, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 12:56:55,520 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2061006.0, ans=0.125 2023-06-28 12:57:24,115 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.42 vs. limit=12.0 2023-06-28 12:58:04,709 INFO [train.py:996] (2/4) Epoch 12, batch 8100, loss[loss=0.1971, simple_loss=0.2751, pruned_loss=0.05961, over 21929.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.2968, pruned_loss=0.06909, over 4267462.88 frames. ], batch size: 316, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 12:58:32,379 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.03 vs. limit=6.0 2023-06-28 12:58:53,298 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.830e+02 7.832e+02 1.202e+03 2.450e+03 5.574e+03, threshold=2.405e+03, percent-clipped=10.0 2023-06-28 12:59:35,556 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2061486.0, ans=0.0 2023-06-28 12:59:45,801 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.23 vs. limit=15.0 2023-06-28 12:59:56,695 INFO [train.py:996] (2/4) Epoch 12, batch 8150, loss[loss=0.2669, simple_loss=0.3779, pruned_loss=0.07794, over 21572.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.3073, pruned_loss=0.07058, over 4269419.96 frames. ], batch size: 441, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:00:29,449 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.64 vs. limit=15.0 2023-06-28 13:00:43,814 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=2061666.0, ans=0.0 2023-06-28 13:00:53,822 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2061666.0, ans=0.0 2023-06-28 13:01:39,551 INFO [train.py:996] (2/4) Epoch 12, batch 8200, loss[loss=0.1688, simple_loss=0.2427, pruned_loss=0.04742, over 21364.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2985, pruned_loss=0.06833, over 4257425.91 frames. ], batch size: 131, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:01:48,699 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff2.min_abs, batch_count=2061846.0, ans=0.1 2023-06-28 13:02:21,472 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.733e+02 7.541e+02 1.166e+03 1.975e+03 4.840e+03, threshold=2.333e+03, percent-clipped=18.0 2023-06-28 13:02:49,448 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2062026.0, ans=0.1 2023-06-28 13:03:05,535 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=2062086.0, ans=0.0 2023-06-28 13:03:23,733 INFO [train.py:996] (2/4) Epoch 12, batch 8250, loss[loss=0.2128, simple_loss=0.3084, pruned_loss=0.05859, over 21382.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2976, pruned_loss=0.06779, over 4263134.20 frames. ], batch size: 194, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:03:46,959 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.06 vs. limit=15.0 2023-06-28 13:04:55,160 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2062386.0, ans=0.0 2023-06-28 13:04:56,615 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2062386.0, ans=0.0 2023-06-28 13:05:01,813 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2062386.0, ans=0.125 2023-06-28 13:05:07,865 INFO [train.py:996] (2/4) Epoch 12, batch 8300, loss[loss=0.2069, simple_loss=0.2951, pruned_loss=0.05934, over 21812.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.2954, pruned_loss=0.06537, over 4263953.31 frames. ], batch size: 371, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:05:29,274 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.35 vs. limit=15.0 2023-06-28 13:05:49,522 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.960e+02 7.792e+02 1.211e+03 1.944e+03 6.178e+03, threshold=2.421e+03, percent-clipped=18.0 2023-06-28 13:06:55,852 INFO [train.py:996] (2/4) Epoch 12, batch 8350, loss[loss=0.2053, simple_loss=0.2884, pruned_loss=0.06108, over 20027.00 frames. ], tot_loss[loss=0.2109, simple_loss=0.2952, pruned_loss=0.0633, over 4261761.73 frames. ], batch size: 703, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:07:20,768 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2062806.0, ans=0.125 2023-06-28 13:07:31,019 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=2062866.0, ans=0.2 2023-06-28 13:08:15,736 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2062926.0, ans=0.125 2023-06-28 13:08:32,687 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.80 vs. limit=15.0 2023-06-28 13:08:39,736 INFO [train.py:996] (2/4) Epoch 12, batch 8400, loss[loss=0.1776, simple_loss=0.2599, pruned_loss=0.04765, over 21432.00 frames. ], tot_loss[loss=0.2086, simple_loss=0.2929, pruned_loss=0.06209, over 4249169.00 frames. ], batch size: 131, lr: 2.44e-03, grad_scale: 32.0 2023-06-28 13:09:01,450 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=2063106.0, ans=0.2 2023-06-28 13:09:06,433 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=2063106.0, ans=0.04949747468305833 2023-06-28 13:09:21,799 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.051e+02 6.739e+02 1.036e+03 1.500e+03 3.619e+03, threshold=2.071e+03, percent-clipped=10.0 2023-06-28 13:10:00,458 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2063286.0, ans=0.125 2023-06-28 13:10:21,248 INFO [train.py:996] (2/4) Epoch 12, batch 8450, loss[loss=0.2054, simple_loss=0.2757, pruned_loss=0.06757, over 21755.00 frames. ], tot_loss[loss=0.2062, simple_loss=0.2901, pruned_loss=0.06116, over 4254624.42 frames. ], batch size: 298, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:11:06,696 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2063466.0, ans=0.1 2023-06-28 13:11:38,054 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2063526.0, ans=0.1 2023-06-28 13:11:40,388 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.83 vs. limit=15.0 2023-06-28 13:11:56,701 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=2063586.0, ans=0.0 2023-06-28 13:12:04,196 INFO [train.py:996] (2/4) Epoch 12, batch 8500, loss[loss=0.1959, simple_loss=0.2635, pruned_loss=0.06415, over 21760.00 frames. ], tot_loss[loss=0.2049, simple_loss=0.2854, pruned_loss=0.06215, over 4264395.13 frames. ], batch size: 316, lr: 2.44e-03, grad_scale: 8.0 2023-06-28 13:12:08,410 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=2063646.0, ans=0.07 2023-06-28 13:12:25,260 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2063706.0, ans=0.0 2023-06-28 13:12:49,780 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.731e+02 8.144e+02 1.139e+03 1.907e+03 5.140e+03, threshold=2.279e+03, percent-clipped=18.0 2023-06-28 13:13:13,811 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2063826.0, ans=0.0 2023-06-28 13:13:42,216 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=2063886.0, ans=0.0 2023-06-28 13:13:48,481 INFO [train.py:996] (2/4) Epoch 12, batch 8550, loss[loss=0.2385, simple_loss=0.3354, pruned_loss=0.07078, over 21626.00 frames. ], tot_loss[loss=0.2097, simple_loss=0.2902, pruned_loss=0.06458, over 4266412.63 frames. ], batch size: 389, lr: 2.44e-03, grad_scale: 8.0 2023-06-28 13:13:50,784 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2063946.0, ans=0.1 2023-06-28 13:14:03,080 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2063946.0, ans=0.0 2023-06-28 13:15:17,094 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 13:15:30,460 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2064186.0, ans=0.125 2023-06-28 13:15:34,952 INFO [train.py:996] (2/4) Epoch 12, batch 8600, loss[loss=0.2356, simple_loss=0.3162, pruned_loss=0.07746, over 21711.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.2961, pruned_loss=0.06652, over 4268717.82 frames. ], batch size: 298, lr: 2.44e-03, grad_scale: 8.0 2023-06-28 13:16:20,880 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2064366.0, ans=0.125 2023-06-28 13:16:29,851 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.562e+02 1.076e+03 1.611e+03 2.403e+03 4.318e+03, threshold=3.223e+03, percent-clipped=30.0 2023-06-28 13:16:35,670 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2064366.0, ans=0.125 2023-06-28 13:16:36,165 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.38 vs. limit=22.5 2023-06-28 13:17:04,373 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.18 vs. limit=22.5 2023-06-28 13:17:18,552 INFO [train.py:996] (2/4) Epoch 12, batch 8650, loss[loss=0.2722, simple_loss=0.3591, pruned_loss=0.09268, over 21469.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.3014, pruned_loss=0.06797, over 4269940.80 frames. ], batch size: 507, lr: 2.44e-03, grad_scale: 8.0 2023-06-28 13:18:19,659 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=2064666.0, ans=0.2 2023-06-28 13:18:59,815 INFO [train.py:996] (2/4) Epoch 12, batch 8700, loss[loss=0.1942, simple_loss=0.2676, pruned_loss=0.06043, over 21867.00 frames. ], tot_loss[loss=0.2119, simple_loss=0.2936, pruned_loss=0.06509, over 4270204.65 frames. ], batch size: 107, lr: 2.44e-03, grad_scale: 8.0 2023-06-28 13:19:19,485 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2064906.0, ans=0.125 2023-06-28 13:19:53,460 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.697e+02 7.863e+02 1.211e+03 1.985e+03 4.359e+03, threshold=2.422e+03, percent-clipped=4.0 2023-06-28 13:20:31,110 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.53 vs. limit=15.0 2023-06-28 13:20:35,687 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=2065086.0, ans=0.125 2023-06-28 13:20:41,881 INFO [train.py:996] (2/4) Epoch 12, batch 8750, loss[loss=0.1874, simple_loss=0.2626, pruned_loss=0.05606, over 21986.00 frames. ], tot_loss[loss=0.2093, simple_loss=0.2887, pruned_loss=0.06497, over 4274680.65 frames. ], batch size: 103, lr: 2.44e-03, grad_scale: 8.0 2023-06-28 13:21:01,372 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2065146.0, ans=0.0 2023-06-28 13:21:05,092 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2065206.0, ans=0.125 2023-06-28 13:21:40,191 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2065266.0, ans=0.125 2023-06-28 13:22:20,472 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2065386.0, ans=0.1 2023-06-28 13:22:31,075 INFO [train.py:996] (2/4) Epoch 12, batch 8800, loss[loss=0.248, simple_loss=0.327, pruned_loss=0.08451, over 21537.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2953, pruned_loss=0.06681, over 4280165.65 frames. ], batch size: 194, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:22:33,982 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.95 vs. limit=15.0 2023-06-28 13:22:55,437 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=2065506.0, ans=0.2 2023-06-28 13:23:15,116 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.10 vs. limit=6.0 2023-06-28 13:23:26,839 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.165e+02 8.763e+02 1.222e+03 1.735e+03 3.559e+03, threshold=2.444e+03, percent-clipped=10.0 2023-06-28 13:23:47,796 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2065626.0, ans=0.0 2023-06-28 13:24:15,728 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.93 vs. limit=22.5 2023-06-28 13:24:16,137 INFO [train.py:996] (2/4) Epoch 12, batch 8850, loss[loss=0.2331, simple_loss=0.3039, pruned_loss=0.08113, over 21348.00 frames. ], tot_loss[loss=0.2183, simple_loss=0.3004, pruned_loss=0.06815, over 4267941.85 frames. ], batch size: 471, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:24:26,742 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=2065746.0, ans=0.0 2023-06-28 13:24:26,827 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2065746.0, ans=0.125 2023-06-28 13:24:31,735 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2065746.0, ans=0.0 2023-06-28 13:25:01,202 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=2065806.0, ans=0.2 2023-06-28 13:25:19,098 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.65 vs. limit=5.0 2023-06-28 13:25:21,692 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2065926.0, ans=0.125 2023-06-28 13:25:55,914 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=2065986.0, ans=0.125 2023-06-28 13:26:05,235 INFO [train.py:996] (2/4) Epoch 12, batch 8900, loss[loss=0.1986, simple_loss=0.2673, pruned_loss=0.06492, over 21594.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2954, pruned_loss=0.06739, over 4270205.07 frames. ], batch size: 415, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:26:57,497 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.266e+02 7.347e+02 1.235e+03 1.790e+03 4.739e+03, threshold=2.470e+03, percent-clipped=10.0 2023-06-28 13:27:56,321 INFO [train.py:996] (2/4) Epoch 12, batch 8950, loss[loss=0.1942, simple_loss=0.2665, pruned_loss=0.06097, over 21834.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2971, pruned_loss=0.0666, over 4275200.67 frames. ], batch size: 98, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:28:19,991 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2066406.0, ans=0.1 2023-06-28 13:28:20,079 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2066406.0, ans=0.1 2023-06-28 13:28:38,843 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=2066466.0, ans=0.07 2023-06-28 13:29:27,886 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2066586.0, ans=0.125 2023-06-28 13:29:38,977 INFO [train.py:996] (2/4) Epoch 12, batch 9000, loss[loss=0.188, simple_loss=0.2557, pruned_loss=0.06018, over 21672.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.2919, pruned_loss=0.06681, over 4271261.72 frames. ], batch size: 248, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:29:38,977 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-28 13:29:59,527 INFO [train.py:1028] (2/4) Epoch 12, validation: loss=0.2628, simple_loss=0.3535, pruned_loss=0.086, over 1796401.00 frames. 2023-06-28 13:29:59,528 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 23793MB 2023-06-28 13:30:06,909 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=2066646.0, ans=0.125 2023-06-28 13:30:44,982 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.661e+02 7.055e+02 9.403e+02 1.588e+03 4.919e+03, threshold=1.881e+03, percent-clipped=11.0 2023-06-28 13:30:51,495 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.21 vs. limit=6.0 2023-06-28 13:30:54,619 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.29 vs. limit=15.0 2023-06-28 13:31:09,666 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 13:31:18,862 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.14 vs. limit=6.0 2023-06-28 13:31:34,828 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2066886.0, ans=0.125 2023-06-28 13:31:44,388 INFO [train.py:996] (2/4) Epoch 12, batch 9050, loss[loss=0.2115, simple_loss=0.288, pruned_loss=0.06749, over 21584.00 frames. ], tot_loss[loss=0.2072, simple_loss=0.2876, pruned_loss=0.06335, over 4276231.30 frames. ], batch size: 263, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:31:54,213 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.72 vs. limit=15.0 2023-06-28 13:33:22,384 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2067186.0, ans=0.125 2023-06-28 13:33:30,610 INFO [train.py:996] (2/4) Epoch 12, batch 9100, loss[loss=0.2064, simple_loss=0.3089, pruned_loss=0.05192, over 21682.00 frames. ], tot_loss[loss=0.2121, simple_loss=0.2931, pruned_loss=0.06558, over 4286329.31 frames. ], batch size: 351, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:34:04,979 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=2067306.0, ans=0.125 2023-06-28 13:34:22,066 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.166e+02 1.280e+03 2.185e+03 3.198e+03 4.785e+03, threshold=4.371e+03, percent-clipped=55.0 2023-06-28 13:35:16,206 INFO [train.py:996] (2/4) Epoch 12, batch 9150, loss[loss=0.1961, simple_loss=0.2898, pruned_loss=0.05118, over 21736.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2988, pruned_loss=0.06412, over 4286059.76 frames. ], batch size: 247, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:35:55,303 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=2067606.0, ans=0.09899494936611666 2023-06-28 13:36:13,528 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=2067666.0, ans=0.0 2023-06-28 13:36:59,438 INFO [train.py:996] (2/4) Epoch 12, batch 9200, loss[loss=0.3083, simple_loss=0.3716, pruned_loss=0.1225, over 21354.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.3007, pruned_loss=0.06353, over 4287847.01 frames. ], batch size: 507, lr: 2.44e-03, grad_scale: 32.0 2023-06-28 13:37:01,052 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=8.51 vs. limit=15.0 2023-06-28 13:38:01,133 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.761e+02 9.017e+02 1.569e+03 2.101e+03 3.767e+03, threshold=3.138e+03, percent-clipped=0.0 2023-06-28 13:38:25,927 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.20 vs. limit=6.0 2023-06-28 13:38:35,083 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.79 vs. limit=15.0 2023-06-28 13:38:48,656 INFO [train.py:996] (2/4) Epoch 12, batch 9250, loss[loss=0.1853, simple_loss=0.268, pruned_loss=0.05128, over 20767.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.3019, pruned_loss=0.06598, over 4289191.13 frames. ], batch size: 607, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:40:01,363 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2068326.0, ans=0.125 2023-06-28 13:40:28,772 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2068386.0, ans=0.125 2023-06-28 13:40:39,790 INFO [train.py:996] (2/4) Epoch 12, batch 9300, loss[loss=0.2529, simple_loss=0.3131, pruned_loss=0.09631, over 21275.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2981, pruned_loss=0.06542, over 4273845.36 frames. ], batch size: 471, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:40:42,511 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.95 vs. limit=22.5 2023-06-28 13:41:32,625 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.166e+02 1.033e+03 1.685e+03 2.661e+03 5.053e+03, threshold=3.371e+03, percent-clipped=15.0 2023-06-28 13:42:25,451 INFO [train.py:996] (2/4) Epoch 12, batch 9350, loss[loss=0.2157, simple_loss=0.3049, pruned_loss=0.06324, over 21428.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.3024, pruned_loss=0.06608, over 4277367.23 frames. ], batch size: 131, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:42:51,206 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=2068806.0, ans=0.0 2023-06-28 13:43:12,189 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.37 vs. limit=15.0 2023-06-28 13:43:25,454 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2068926.0, ans=0.125 2023-06-28 13:43:47,560 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=2068926.0, ans=0.125 2023-06-28 13:43:49,302 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_ff3.min_abs, batch_count=2068926.0, ans=0.2 2023-06-28 13:43:49,320 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2068926.0, ans=0.0 2023-06-28 13:43:54,506 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2068986.0, ans=0.125 2023-06-28 13:43:55,305 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.88 vs. limit=5.0 2023-06-28 13:44:09,414 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2069046.0, ans=0.125 2023-06-28 13:44:15,566 INFO [train.py:996] (2/4) Epoch 12, batch 9400, loss[loss=0.1933, simple_loss=0.2661, pruned_loss=0.06031, over 21764.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.3023, pruned_loss=0.06632, over 4277520.21 frames. ], batch size: 351, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:44:31,034 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2069046.0, ans=0.0 2023-06-28 13:44:52,380 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=2069166.0, ans=0.04949747468305833 2023-06-28 13:44:52,438 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2069166.0, ans=0.125 2023-06-28 13:45:01,445 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.886e+02 7.931e+02 1.125e+03 1.716e+03 3.605e+03, threshold=2.249e+03, percent-clipped=1.0 2023-06-28 13:45:20,389 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2069226.0, ans=0.125 2023-06-28 13:45:47,110 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2069286.0, ans=0.0 2023-06-28 13:45:58,263 INFO [train.py:996] (2/4) Epoch 12, batch 9450, loss[loss=0.1869, simple_loss=0.2621, pruned_loss=0.05583, over 21756.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2965, pruned_loss=0.06581, over 4255738.80 frames. ], batch size: 118, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:47:06,791 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2069526.0, ans=0.0 2023-06-28 13:47:06,830 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=2069526.0, ans=0.125 2023-06-28 13:47:41,508 INFO [train.py:996] (2/4) Epoch 12, batch 9500, loss[loss=0.1743, simple_loss=0.2431, pruned_loss=0.05274, over 21768.00 frames. ], tot_loss[loss=0.2084, simple_loss=0.2887, pruned_loss=0.06409, over 4264741.85 frames. ], batch size: 317, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:47:55,467 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2069646.0, ans=0.1 2023-06-28 13:48:00,703 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 13:48:08,852 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2069706.0, ans=0.125 2023-06-28 13:48:19,555 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=2069766.0, ans=0.025 2023-06-28 13:48:21,448 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=2069766.0, ans=0.2 2023-06-28 13:48:38,316 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.464e+02 8.117e+02 1.177e+03 1.570e+03 4.123e+03, threshold=2.354e+03, percent-clipped=16.0 2023-06-28 13:48:40,486 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=2069766.0, ans=0.2 2023-06-28 13:48:55,487 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2069826.0, ans=0.125 2023-06-28 13:49:08,873 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2069886.0, ans=0.1 2023-06-28 13:49:25,117 INFO [train.py:996] (2/4) Epoch 12, batch 9550, loss[loss=0.2511, simple_loss=0.3378, pruned_loss=0.08222, over 21820.00 frames. ], tot_loss[loss=0.2106, simple_loss=0.2906, pruned_loss=0.06531, over 4260490.21 frames. ], batch size: 124, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:50:09,258 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2070066.0, ans=0.0 2023-06-28 13:50:21,242 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2070066.0, ans=0.125 2023-06-28 13:50:40,512 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.86 vs. limit=22.5 2023-06-28 13:50:58,321 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=2070186.0, ans=0.025 2023-06-28 13:51:04,176 INFO [train.py:996] (2/4) Epoch 12, batch 9600, loss[loss=0.2057, simple_loss=0.2819, pruned_loss=0.06477, over 21775.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2931, pruned_loss=0.06689, over 4272846.33 frames. ], batch size: 389, lr: 2.44e-03, grad_scale: 32.0 2023-06-28 13:52:01,563 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.070e+02 8.059e+02 1.139e+03 1.979e+03 4.989e+03, threshold=2.277e+03, percent-clipped=18.0 2023-06-28 13:52:12,118 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2070426.0, ans=0.125 2023-06-28 13:52:51,595 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.52 vs. limit=6.0 2023-06-28 13:52:52,034 INFO [train.py:996] (2/4) Epoch 12, batch 9650, loss[loss=0.2445, simple_loss=0.3314, pruned_loss=0.07879, over 21804.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.294, pruned_loss=0.06729, over 4277642.80 frames. ], batch size: 124, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:53:34,532 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=2070666.0, ans=0.125 2023-06-28 13:53:52,815 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2070726.0, ans=0.125 2023-06-28 13:53:52,828 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2070726.0, ans=0.125 2023-06-28 13:54:36,728 INFO [train.py:996] (2/4) Epoch 12, batch 9700, loss[loss=0.1971, simple_loss=0.2754, pruned_loss=0.05945, over 21719.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2973, pruned_loss=0.06789, over 4274083.54 frames. ], batch size: 247, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:55:29,995 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.817e+02 8.034e+02 1.157e+03 1.856e+03 3.207e+03, threshold=2.314e+03, percent-clipped=13.0 2023-06-28 13:55:41,575 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.46 vs. limit=15.0 2023-06-28 13:56:19,106 INFO [train.py:996] (2/4) Epoch 12, batch 9750, loss[loss=0.207, simple_loss=0.2837, pruned_loss=0.06518, over 21390.00 frames. ], tot_loss[loss=0.2129, simple_loss=0.2926, pruned_loss=0.06655, over 4277090.48 frames. ], batch size: 131, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:57:07,911 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2071266.0, ans=0.125 2023-06-28 13:57:09,500 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=2071266.0, ans=0.0 2023-06-28 13:57:20,950 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=2071326.0, ans=0.0 2023-06-28 13:58:01,398 INFO [train.py:996] (2/4) Epoch 12, batch 9800, loss[loss=0.1788, simple_loss=0.2276, pruned_loss=0.06501, over 20755.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.2911, pruned_loss=0.06685, over 4281090.64 frames. ], batch size: 609, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:58:16,988 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2071506.0, ans=0.0 2023-06-28 13:58:17,465 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.04 vs. limit=22.5 2023-06-28 13:58:51,754 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=2071566.0, ans=0.2 2023-06-28 13:58:54,275 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.350e+02 9.272e+02 1.641e+03 2.423e+03 5.120e+03, threshold=3.282e+03, percent-clipped=25.0 2023-06-28 13:59:24,301 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2071686.0, ans=0.125 2023-06-28 13:59:37,649 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2071686.0, ans=0.125 2023-06-28 13:59:43,789 INFO [train.py:996] (2/4) Epoch 12, batch 9850, loss[loss=0.1829, simple_loss=0.254, pruned_loss=0.05591, over 21801.00 frames. ], tot_loss[loss=0.2103, simple_loss=0.2873, pruned_loss=0.06661, over 4281066.56 frames. ], batch size: 351, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:59:52,376 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=2071746.0, ans=0.2 2023-06-28 14:00:16,874 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2071806.0, ans=0.1 2023-06-28 14:00:53,595 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=2071926.0, ans=15.0 2023-06-28 14:00:53,742 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.53 vs. limit=15.0 2023-06-28 14:00:55,012 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2071926.0, ans=0.0 2023-06-28 14:01:03,053 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=2071986.0, ans=0.0 2023-06-28 14:01:25,451 INFO [train.py:996] (2/4) Epoch 12, batch 9900, loss[loss=0.2445, simple_loss=0.3216, pruned_loss=0.0837, over 21561.00 frames. ], tot_loss[loss=0.2089, simple_loss=0.2849, pruned_loss=0.06651, over 4275567.21 frames. ], batch size: 414, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 14:01:33,552 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.85 vs. limit=22.5 2023-06-28 14:01:39,473 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2072046.0, ans=0.125 2023-06-28 14:02:19,781 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.295e+02 1.063e+03 1.503e+03 2.102e+03 4.753e+03, threshold=3.006e+03, percent-clipped=10.0 2023-06-28 14:02:38,817 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=2072226.0, ans=0.0 2023-06-28 14:03:09,602 INFO [train.py:996] (2/4) Epoch 12, batch 9950, loss[loss=0.1907, simple_loss=0.2558, pruned_loss=0.06284, over 21579.00 frames. ], tot_loss[loss=0.21, simple_loss=0.285, pruned_loss=0.06751, over 4254200.92 frames. ], batch size: 231, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 14:04:10,417 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2072526.0, ans=0.125 2023-06-28 14:04:21,988 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=2072526.0, ans=0.2 2023-06-28 14:04:26,725 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2072526.0, ans=0.0 2023-06-28 14:04:27,448 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.04 vs. limit=12.0 2023-06-28 14:04:50,381 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2072586.0, ans=0.1 2023-06-28 14:04:52,802 INFO [train.py:996] (2/4) Epoch 12, batch 10000, loss[loss=0.2236, simple_loss=0.2834, pruned_loss=0.08196, over 21528.00 frames. ], tot_loss[loss=0.2086, simple_loss=0.2827, pruned_loss=0.06721, over 4264946.09 frames. ], batch size: 441, lr: 2.44e-03, grad_scale: 32.0 2023-06-28 14:05:35,306 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.68 vs. limit=12.0 2023-06-28 14:05:44,611 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2072766.0, ans=0.0 2023-06-28 14:05:50,334 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.893e+02 6.803e+02 1.015e+03 1.604e+03 3.420e+03, threshold=2.029e+03, percent-clipped=1.0 2023-06-28 14:06:36,075 INFO [train.py:996] (2/4) Epoch 12, batch 10050, loss[loss=0.1831, simple_loss=0.2526, pruned_loss=0.05682, over 21335.00 frames. ], tot_loss[loss=0.2109, simple_loss=0.2852, pruned_loss=0.06824, over 4264901.88 frames. ], batch size: 194, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 14:07:08,609 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=2073006.0, ans=0.125 2023-06-28 14:07:42,880 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2073126.0, ans=0.0 2023-06-28 14:08:10,037 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2073186.0, ans=0.125 2023-06-28 14:08:13,337 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2073186.0, ans=0.0 2023-06-28 14:08:17,334 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.92 vs. limit=22.5 2023-06-28 14:08:21,354 INFO [train.py:996] (2/4) Epoch 12, batch 10100, loss[loss=0.1474, simple_loss=0.2148, pruned_loss=0.04002, over 15671.00 frames. ], tot_loss[loss=0.2077, simple_loss=0.2835, pruned_loss=0.06599, over 4265567.70 frames. ], batch size: 60, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 14:08:22,266 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2073246.0, ans=0.125 2023-06-28 14:08:34,133 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.59 vs. limit=6.0 2023-06-28 14:08:52,543 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.00 vs. limit=6.0 2023-06-28 14:08:55,195 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=2073306.0, ans=0.125 2023-06-28 14:09:11,870 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=2073366.0, ans=0.125 2023-06-28 14:09:21,187 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.696e+02 9.806e+02 1.615e+03 2.401e+03 4.786e+03, threshold=3.230e+03, percent-clipped=36.0 2023-06-28 14:09:39,238 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.21 vs. limit=6.0 2023-06-28 14:10:10,008 INFO [train.py:996] (2/4) Epoch 12, batch 10150, loss[loss=0.2043, simple_loss=0.2784, pruned_loss=0.06514, over 21790.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.2882, pruned_loss=0.06755, over 4260235.87 frames. ], batch size: 371, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 14:10:23,623 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=2073546.0, ans=0.04949747468305833 2023-06-28 14:10:29,933 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2073606.0, ans=0.125 2023-06-28 14:10:30,011 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2073606.0, ans=0.1 2023-06-28 14:10:50,258 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=2073606.0, ans=0.2 2023-06-28 14:10:50,347 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2073606.0, ans=0.1 2023-06-28 14:11:07,596 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=2073666.0, ans=0.04949747468305833 2023-06-28 14:11:30,592 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2073786.0, ans=0.0 2023-06-28 14:11:52,844 INFO [train.py:996] (2/4) Epoch 12, batch 10200, loss[loss=0.2295, simple_loss=0.3121, pruned_loss=0.07346, over 21564.00 frames. ], tot_loss[loss=0.2105, simple_loss=0.289, pruned_loss=0.06607, over 4263664.33 frames. ], batch size: 441, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 14:11:57,035 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2073846.0, ans=0.125 2023-06-28 14:12:08,435 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=2073846.0, ans=0.0 2023-06-28 14:12:18,544 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=2073906.0, ans=0.2 2023-06-28 14:12:47,766 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.411e+02 8.616e+02 1.269e+03 2.043e+03 3.610e+03, threshold=2.539e+03, percent-clipped=1.0 2023-06-28 14:13:11,666 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2074026.0, ans=0.1 2023-06-28 14:13:16,463 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2074086.0, ans=0.0 2023-06-28 14:13:18,907 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.88 vs. limit=22.5 2023-06-28 14:13:40,919 INFO [train.py:996] (2/4) Epoch 12, batch 10250, loss[loss=0.1563, simple_loss=0.2517, pruned_loss=0.03046, over 21657.00 frames. ], tot_loss[loss=0.2026, simple_loss=0.2837, pruned_loss=0.06079, over 4263559.80 frames. ], batch size: 414, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 14:13:41,525 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2074146.0, ans=0.0 2023-06-28 14:14:10,552 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.48 vs. limit=15.0 2023-06-28 14:14:21,513 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2074266.0, ans=0.125 2023-06-28 14:15:25,090 INFO [train.py:996] (2/4) Epoch 12, batch 10300, loss[loss=0.2384, simple_loss=0.3345, pruned_loss=0.07117, over 21784.00 frames. ], tot_loss[loss=0.2045, simple_loss=0.2853, pruned_loss=0.06181, over 4268197.04 frames. ], batch size: 332, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 14:15:27,513 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=2074446.0, ans=0.07 2023-06-28 14:15:38,762 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.18 vs. limit=10.0 2023-06-28 14:15:58,972 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2074506.0, ans=0.125 2023-06-28 14:16:17,963 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=2074566.0, ans=0.2 2023-06-28 14:16:21,418 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2074566.0, ans=0.1 2023-06-28 14:16:22,425 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.719e+02 6.981e+02 1.162e+03 1.847e+03 5.403e+03, threshold=2.324e+03, percent-clipped=10.0 2023-06-28 14:16:23,003 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2074566.0, ans=0.0 2023-06-28 14:16:31,903 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2074626.0, ans=0.1 2023-06-28 14:17:11,848 INFO [train.py:996] (2/4) Epoch 12, batch 10350, loss[loss=0.3022, simple_loss=0.3785, pruned_loss=0.113, over 21422.00 frames. ], tot_loss[loss=0.2077, simple_loss=0.2891, pruned_loss=0.06318, over 4269000.84 frames. ], batch size: 507, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 14:17:45,032 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.92 vs. limit=6.0 2023-06-28 14:17:57,935 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2074866.0, ans=0.0 2023-06-28 14:18:04,900 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=2074866.0, ans=0.2 2023-06-28 14:18:21,338 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2074926.0, ans=0.1 2023-06-28 14:18:26,253 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2074926.0, ans=0.125 2023-06-28 14:19:00,730 INFO [train.py:996] (2/4) Epoch 12, batch 10400, loss[loss=0.1707, simple_loss=0.2401, pruned_loss=0.05066, over 21657.00 frames. ], tot_loss[loss=0.2041, simple_loss=0.2831, pruned_loss=0.06257, over 4266365.26 frames. ], batch size: 230, lr: 2.44e-03, grad_scale: 32.0 2023-06-28 14:19:20,210 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=2075046.0, ans=0.125 2023-06-28 14:19:58,465 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.939e+02 1.030e+03 1.665e+03 2.817e+03 5.984e+03, threshold=3.330e+03, percent-clipped=36.0 2023-06-28 14:20:04,758 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.05 vs. limit=6.0 2023-06-28 14:20:17,972 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=2075226.0, ans=0.125 2023-06-28 14:20:43,692 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=2075286.0, ans=0.95 2023-06-28 14:20:46,350 INFO [train.py:996] (2/4) Epoch 12, batch 10450, loss[loss=0.2259, simple_loss=0.3236, pruned_loss=0.06406, over 20769.00 frames. ], tot_loss[loss=0.2098, simple_loss=0.2882, pruned_loss=0.06573, over 4271184.89 frames. ], batch size: 607, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 14:20:46,963 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=2075346.0, ans=0.2 2023-06-28 14:20:59,454 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.11 vs. limit=15.0 2023-06-28 14:21:17,102 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2075406.0, ans=0.1 2023-06-28 14:22:34,292 INFO [train.py:996] (2/4) Epoch 12, batch 10500, loss[loss=0.1839, simple_loss=0.2555, pruned_loss=0.05617, over 21646.00 frames. ], tot_loss[loss=0.2086, simple_loss=0.2892, pruned_loss=0.06407, over 4270082.13 frames. ], batch size: 298, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 14:22:49,785 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=2075706.0, ans=0.09899494936611666 2023-06-28 14:23:30,263 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.346e+02 7.811e+02 1.278e+03 1.903e+03 4.033e+03, threshold=2.556e+03, percent-clipped=2.0 2023-06-28 14:24:16,702 INFO [train.py:996] (2/4) Epoch 12, batch 10550, loss[loss=0.2085, simple_loss=0.2658, pruned_loss=0.07557, over 21256.00 frames. ], tot_loss[loss=0.2042, simple_loss=0.2829, pruned_loss=0.06273, over 4246945.09 frames. ], batch size: 471, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 14:24:35,740 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=2076006.0, ans=0.125 2023-06-28 14:24:52,163 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 14:25:05,642 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=2076066.0, ans=0.95 2023-06-28 14:25:13,571 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=2076066.0, ans=0.0 2023-06-28 14:25:17,240 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=2076126.0, ans=0.0 2023-06-28 14:25:39,078 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=2076186.0, ans=10.0 2023-06-28 14:25:42,555 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2076186.0, ans=0.125 2023-06-28 14:26:00,759 INFO [train.py:996] (2/4) Epoch 12, batch 10600, loss[loss=0.1688, simple_loss=0.2741, pruned_loss=0.03174, over 19676.00 frames. ], tot_loss[loss=0.2006, simple_loss=0.2784, pruned_loss=0.06138, over 4248875.01 frames. ], batch size: 702, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 14:26:21,161 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=2076306.0, ans=0.0 2023-06-28 14:26:29,274 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2076306.0, ans=0.125 2023-06-28 14:26:59,366 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.873e+02 6.316e+02 8.507e+02 1.506e+03 2.988e+03, threshold=1.701e+03, percent-clipped=6.0 2023-06-28 14:27:27,824 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2076486.0, ans=0.125 2023-06-28 14:27:46,054 INFO [train.py:996] (2/4) Epoch 12, batch 10650, loss[loss=0.1724, simple_loss=0.2556, pruned_loss=0.04464, over 21704.00 frames. ], tot_loss[loss=0.1999, simple_loss=0.2796, pruned_loss=0.06007, over 4257904.58 frames. ], batch size: 298, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 14:27:57,533 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.66 vs. limit=22.5 2023-06-28 14:28:33,993 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.04 vs. limit=10.0 2023-06-28 14:29:05,479 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.37 vs. limit=15.0 2023-06-28 14:29:07,077 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.52 vs. limit=22.5 2023-06-28 14:29:08,633 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.15 vs. limit=15.0 2023-06-28 14:29:16,880 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2076786.0, ans=0.1 2023-06-28 14:29:29,915 INFO [train.py:996] (2/4) Epoch 12, batch 10700, loss[loss=0.23, simple_loss=0.3082, pruned_loss=0.07592, over 21361.00 frames. ], tot_loss[loss=0.1985, simple_loss=0.2773, pruned_loss=0.0598, over 4254752.17 frames. ], batch size: 549, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 14:30:15,243 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.13 vs. limit=15.0 2023-06-28 14:30:32,559 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.676e+02 7.956e+02 1.277e+03 1.864e+03 4.109e+03, threshold=2.555e+03, percent-clipped=30.0 2023-06-28 14:30:40,540 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2077026.0, ans=0.125 2023-06-28 14:30:52,894 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2077026.0, ans=0.125 2023-06-28 14:30:58,190 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=2077086.0, ans=0.2 2023-06-28 14:31:21,315 INFO [train.py:996] (2/4) Epoch 12, batch 10750, loss[loss=0.2689, simple_loss=0.3697, pruned_loss=0.08405, over 21641.00 frames. ], tot_loss[loss=0.2063, simple_loss=0.2868, pruned_loss=0.06292, over 4259518.81 frames. ], batch size: 389, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 14:32:03,904 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2077266.0, ans=0.125 2023-06-28 14:33:10,861 INFO [train.py:996] (2/4) Epoch 12, batch 10800, loss[loss=0.2409, simple_loss=0.3154, pruned_loss=0.08324, over 21381.00 frames. ], tot_loss[loss=0.2097, simple_loss=0.2919, pruned_loss=0.06381, over 4260066.87 frames. ], batch size: 159, lr: 2.43e-03, grad_scale: 32.0 2023-06-28 14:33:50,108 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2077506.0, ans=0.1 2023-06-28 14:34:08,052 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.630e+02 8.272e+02 1.352e+03 2.286e+03 6.133e+03, threshold=2.703e+03, percent-clipped=22.0 2023-06-28 14:34:51,872 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=2077686.0, ans=0.0 2023-06-28 14:34:54,577 INFO [train.py:996] (2/4) Epoch 12, batch 10850, loss[loss=0.1818, simple_loss=0.2556, pruned_loss=0.05407, over 21802.00 frames. ], tot_loss[loss=0.2117, simple_loss=0.2935, pruned_loss=0.06491, over 4260105.39 frames. ], batch size: 317, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 14:34:55,054 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=2077746.0, ans=0.2 2023-06-28 14:36:36,181 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=2077986.0, ans=0.125 2023-06-28 14:36:38,974 INFO [train.py:996] (2/4) Epoch 12, batch 10900, loss[loss=0.1847, simple_loss=0.2701, pruned_loss=0.04969, over 21375.00 frames. ], tot_loss[loss=0.2077, simple_loss=0.2877, pruned_loss=0.0639, over 4260298.26 frames. ], batch size: 194, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 14:36:42,961 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2078046.0, ans=0.125 2023-06-28 14:37:00,697 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2078046.0, ans=0.125 2023-06-28 14:37:27,021 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=2078166.0, ans=0.5 2023-06-28 14:37:36,118 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.311e+02 7.551e+02 9.581e+02 1.368e+03 2.722e+03, threshold=1.916e+03, percent-clipped=1.0 2023-06-28 14:37:57,322 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=13.04 vs. limit=15.0 2023-06-28 14:38:20,664 INFO [train.py:996] (2/4) Epoch 12, batch 10950, loss[loss=0.2517, simple_loss=0.3832, pruned_loss=0.06008, over 19688.00 frames. ], tot_loss[loss=0.2046, simple_loss=0.2846, pruned_loss=0.06232, over 4259874.59 frames. ], batch size: 702, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 14:38:32,639 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2078346.0, ans=0.125 2023-06-28 14:39:06,670 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=2078466.0, ans=0.07 2023-06-28 14:39:38,086 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2078526.0, ans=0.125 2023-06-28 14:40:04,311 INFO [train.py:996] (2/4) Epoch 12, batch 11000, loss[loss=0.2045, simple_loss=0.2749, pruned_loss=0.06703, over 21836.00 frames. ], tot_loss[loss=0.205, simple_loss=0.2838, pruned_loss=0.06311, over 4252345.32 frames. ], batch size: 298, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 14:40:08,326 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=2078646.0, ans=0.035 2023-06-28 14:40:18,063 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2078646.0, ans=0.0 2023-06-28 14:40:48,673 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.10 vs. limit=10.0 2023-06-28 14:41:02,134 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.714e+02 8.398e+02 1.287e+03 1.832e+03 5.305e+03, threshold=2.574e+03, percent-clipped=21.0 2023-06-28 14:41:45,743 INFO [train.py:996] (2/4) Epoch 12, batch 11050, loss[loss=0.1978, simple_loss=0.2628, pruned_loss=0.06635, over 21845.00 frames. ], tot_loss[loss=0.2054, simple_loss=0.282, pruned_loss=0.06444, over 4261505.49 frames. ], batch size: 98, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 14:42:01,710 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=2078946.0, ans=0.0 2023-06-28 14:42:43,800 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.54 vs. limit=15.0 2023-06-28 14:42:56,915 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=6.91 vs. limit=22.5 2023-06-28 14:43:05,131 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.37 vs. limit=15.0 2023-06-28 14:43:21,691 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.02 vs. limit=15.0 2023-06-28 14:43:24,003 INFO [train.py:996] (2/4) Epoch 12, batch 11100, loss[loss=0.2061, simple_loss=0.2919, pruned_loss=0.06015, over 21572.00 frames. ], tot_loss[loss=0.2043, simple_loss=0.2797, pruned_loss=0.06446, over 4253447.47 frames. ], batch size: 263, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 14:43:31,334 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2079246.0, ans=0.125 2023-06-28 14:43:31,419 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=2079246.0, ans=0.2 2023-06-28 14:43:55,981 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2079306.0, ans=0.125 2023-06-28 14:44:19,425 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2079366.0, ans=0.125 2023-06-28 14:44:22,369 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.287e+02 7.124e+02 1.046e+03 1.474e+03 3.228e+03, threshold=2.092e+03, percent-clipped=3.0 2023-06-28 14:45:06,629 INFO [train.py:996] (2/4) Epoch 12, batch 11150, loss[loss=0.1781, simple_loss=0.2462, pruned_loss=0.055, over 21009.00 frames. ], tot_loss[loss=0.2034, simple_loss=0.2773, pruned_loss=0.0647, over 4250384.98 frames. ], batch size: 608, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 14:45:33,781 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=2079606.0, ans=0.2 2023-06-28 14:46:33,539 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=2079786.0, ans=0.0 2023-06-28 14:46:45,485 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.80 vs. limit=15.0 2023-06-28 14:46:49,371 INFO [train.py:996] (2/4) Epoch 12, batch 11200, loss[loss=0.2084, simple_loss=0.268, pruned_loss=0.07437, over 21280.00 frames. ], tot_loss[loss=0.202, simple_loss=0.2763, pruned_loss=0.0639, over 4253675.53 frames. ], batch size: 144, lr: 2.43e-03, grad_scale: 32.0 2023-06-28 14:47:08,524 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.whiten.whitening_limit, batch_count=2079846.0, ans=12.0 2023-06-28 14:47:48,166 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.550e+02 9.831e+02 1.329e+03 1.720e+03 5.358e+03, threshold=2.658e+03, percent-clipped=16.0 2023-06-28 14:48:11,119 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2080026.0, ans=0.125 2023-06-28 14:48:30,192 INFO [train.py:996] (2/4) Epoch 12, batch 11250, loss[loss=0.218, simple_loss=0.3072, pruned_loss=0.06443, over 21811.00 frames. ], tot_loss[loss=0.2013, simple_loss=0.2747, pruned_loss=0.06397, over 4259110.19 frames. ], batch size: 124, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 14:48:36,078 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=2080146.0, ans=0.04949747468305833 2023-06-28 14:48:41,222 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=2080146.0, ans=0.05 2023-06-28 14:48:51,467 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.00 vs. limit=15.0 2023-06-28 14:49:23,890 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2080266.0, ans=0.125 2023-06-28 14:50:12,575 INFO [train.py:996] (2/4) Epoch 12, batch 11300, loss[loss=0.1908, simple_loss=0.2656, pruned_loss=0.058, over 21820.00 frames. ], tot_loss[loss=0.2019, simple_loss=0.2761, pruned_loss=0.06388, over 4262995.13 frames. ], batch size: 247, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 14:50:20,214 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=2080446.0, ans=0.0 2023-06-28 14:50:36,297 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=2080446.0, ans=0.025 2023-06-28 14:50:47,225 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.04 vs. limit=12.0 2023-06-28 14:50:53,400 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 14:51:14,611 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.978e+02 7.526e+02 1.048e+03 1.657e+03 3.488e+03, threshold=2.097e+03, percent-clipped=3.0 2023-06-28 14:51:55,959 INFO [train.py:996] (2/4) Epoch 12, batch 11350, loss[loss=0.2484, simple_loss=0.3312, pruned_loss=0.0828, over 21253.00 frames. ], tot_loss[loss=0.2018, simple_loss=0.2768, pruned_loss=0.06345, over 4255023.03 frames. ], batch size: 548, lr: 2.43e-03, grad_scale: 8.0 2023-06-28 14:52:35,190 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2080806.0, ans=0.125 2023-06-28 14:52:56,415 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2080926.0, ans=0.1 2023-06-28 14:53:49,215 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.24 vs. limit=15.0 2023-06-28 14:53:51,205 INFO [train.py:996] (2/4) Epoch 12, batch 11400, loss[loss=0.2435, simple_loss=0.3277, pruned_loss=0.07959, over 21653.00 frames. ], tot_loss[loss=0.2075, simple_loss=0.2838, pruned_loss=0.06563, over 4262108.90 frames. ], batch size: 415, lr: 2.43e-03, grad_scale: 8.0 2023-06-28 14:53:59,806 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=2081046.0, ans=0.125 2023-06-28 14:53:59,981 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=2081046.0, ans=0.125 2023-06-28 14:54:31,785 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2081166.0, ans=0.0 2023-06-28 14:54:51,456 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.83 vs. limit=15.0 2023-06-28 14:54:51,841 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.874e+02 7.904e+02 1.165e+03 1.837e+03 4.416e+03, threshold=2.330e+03, percent-clipped=18.0 2023-06-28 14:54:59,397 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 14:55:34,015 INFO [train.py:996] (2/4) Epoch 12, batch 11450, loss[loss=0.1875, simple_loss=0.264, pruned_loss=0.05555, over 20094.00 frames. ], tot_loss[loss=0.2068, simple_loss=0.2846, pruned_loss=0.06446, over 4251371.29 frames. ], batch size: 702, lr: 2.43e-03, grad_scale: 8.0 2023-06-28 14:55:44,826 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=2081346.0, ans=0.125 2023-06-28 14:55:44,877 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=2081346.0, ans=0.125 2023-06-28 14:55:47,083 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.92 vs. limit=22.5 2023-06-28 14:56:35,366 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2081526.0, ans=0.125 2023-06-28 14:57:18,109 INFO [train.py:996] (2/4) Epoch 12, batch 11500, loss[loss=0.2503, simple_loss=0.3226, pruned_loss=0.08901, over 21265.00 frames. ], tot_loss[loss=0.2096, simple_loss=0.2881, pruned_loss=0.06557, over 4257754.48 frames. ], batch size: 143, lr: 2.43e-03, grad_scale: 8.0 2023-06-28 14:57:55,820 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=2081766.0, ans=0.125 2023-06-28 14:57:55,833 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=2081766.0, ans=0.2 2023-06-28 14:58:20,248 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.484e+02 9.519e+02 1.302e+03 1.957e+03 4.452e+03, threshold=2.605e+03, percent-clipped=16.0 2023-06-28 14:58:49,730 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2081886.0, ans=0.125 2023-06-28 14:59:03,077 INFO [train.py:996] (2/4) Epoch 12, batch 11550, loss[loss=0.264, simple_loss=0.3677, pruned_loss=0.08014, over 21279.00 frames. ], tot_loss[loss=0.2122, simple_loss=0.2934, pruned_loss=0.06553, over 4260818.16 frames. ], batch size: 548, lr: 2.43e-03, grad_scale: 8.0 2023-06-28 14:59:05,500 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=2081946.0, ans=0.04949747468305833 2023-06-28 14:59:45,952 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=2082066.0, ans=0.035 2023-06-28 15:00:15,457 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=2082126.0, ans=0.2 2023-06-28 15:00:17,340 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2082126.0, ans=0.125 2023-06-28 15:00:28,947 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.59 vs. limit=15.0 2023-06-28 15:00:29,044 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.65 vs. limit=15.0 2023-06-28 15:00:46,577 INFO [train.py:996] (2/4) Epoch 12, batch 11600, loss[loss=0.2316, simple_loss=0.3254, pruned_loss=0.06891, over 21398.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.3078, pruned_loss=0.06654, over 4269361.38 frames. ], batch size: 131, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:00:47,511 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2082246.0, ans=0.125 2023-06-28 15:01:16,136 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.36 vs. limit=12.0 2023-06-28 15:01:58,189 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.568e+02 8.664e+02 1.450e+03 2.268e+03 5.007e+03, threshold=2.901e+03, percent-clipped=18.0 2023-06-28 15:02:00,765 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2082426.0, ans=0.0 2023-06-28 15:02:15,569 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=2082486.0, ans=0.015 2023-06-28 15:02:35,266 INFO [train.py:996] (2/4) Epoch 12, batch 11650, loss[loss=0.1981, simple_loss=0.2842, pruned_loss=0.05594, over 21737.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.3128, pruned_loss=0.06681, over 4262708.10 frames. ], batch size: 124, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:02:37,472 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2082546.0, ans=0.125 2023-06-28 15:03:18,251 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=2082666.0, ans=0.0 2023-06-28 15:03:19,924 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=2082666.0, ans=0.2 2023-06-28 15:03:29,868 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2082666.0, ans=0.125 2023-06-28 15:03:54,645 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 15:04:16,904 INFO [train.py:996] (2/4) Epoch 12, batch 11700, loss[loss=0.2284, simple_loss=0.2971, pruned_loss=0.07984, over 21300.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.3064, pruned_loss=0.06688, over 4246993.85 frames. ], batch size: 471, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:05:22,739 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.474e+02 9.550e+02 1.552e+03 2.202e+03 4.902e+03, threshold=3.105e+03, percent-clipped=9.0 2023-06-28 15:05:34,980 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=2083026.0, ans=0.125 2023-06-28 15:06:04,491 INFO [train.py:996] (2/4) Epoch 12, batch 11750, loss[loss=0.2331, simple_loss=0.3057, pruned_loss=0.08026, over 21435.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2982, pruned_loss=0.06599, over 4257923.90 frames. ], batch size: 131, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:06:51,753 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=2083266.0, ans=0.125 2023-06-28 15:06:58,544 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=2083266.0, ans=0.125 2023-06-28 15:06:58,610 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=2083266.0, ans=0.2 2023-06-28 15:07:20,626 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=2083386.0, ans=0.025 2023-06-28 15:07:47,883 INFO [train.py:996] (2/4) Epoch 12, batch 11800, loss[loss=0.2685, simple_loss=0.3619, pruned_loss=0.08759, over 21371.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2988, pruned_loss=0.06712, over 4261080.57 frames. ], batch size: 507, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:08:48,881 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.444e+02 9.330e+02 1.436e+03 2.225e+03 5.022e+03, threshold=2.872e+03, percent-clipped=11.0 2023-06-28 15:09:08,874 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.32 vs. limit=10.0 2023-06-28 15:09:26,671 INFO [train.py:996] (2/4) Epoch 12, batch 11850, loss[loss=0.2102, simple_loss=0.3078, pruned_loss=0.05627, over 21850.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2981, pruned_loss=0.06621, over 4263950.22 frames. ], batch size: 371, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:09:32,553 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 15:10:10,965 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=2083866.0, ans=0.125 2023-06-28 15:10:15,771 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=2083866.0, ans=0.125 2023-06-28 15:10:27,918 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2083926.0, ans=0.1 2023-06-28 15:11:10,427 INFO [train.py:996] (2/4) Epoch 12, batch 11900, loss[loss=0.2216, simple_loss=0.3046, pruned_loss=0.06932, over 21616.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.2976, pruned_loss=0.06403, over 4265317.71 frames. ], batch size: 414, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:11:14,592 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2084046.0, ans=0.125 2023-06-28 15:11:38,432 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2084106.0, ans=0.125 2023-06-28 15:12:11,341 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.53 vs. limit=10.0 2023-06-28 15:12:13,559 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.185e+02 7.198e+02 9.065e+02 1.390e+03 3.282e+03, threshold=1.813e+03, percent-clipped=3.0 2023-06-28 15:12:19,468 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2084226.0, ans=0.0 2023-06-28 15:12:51,827 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=2084286.0, ans=0.5 2023-06-28 15:12:51,849 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2084286.0, ans=0.125 2023-06-28 15:12:54,394 INFO [train.py:996] (2/4) Epoch 12, batch 11950, loss[loss=0.1846, simple_loss=0.2796, pruned_loss=0.04482, over 21666.00 frames. ], tot_loss[loss=0.2107, simple_loss=0.2981, pruned_loss=0.06168, over 4266669.76 frames. ], batch size: 247, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:13:45,921 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=2084466.0, ans=0.2 2023-06-28 15:14:26,342 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=2084586.0, ans=0.2 2023-06-28 15:14:35,776 INFO [train.py:996] (2/4) Epoch 12, batch 12000, loss[loss=0.1919, simple_loss=0.2612, pruned_loss=0.0613, over 21723.00 frames. ], tot_loss[loss=0.2059, simple_loss=0.2914, pruned_loss=0.06016, over 4260433.98 frames. ], batch size: 112, lr: 2.43e-03, grad_scale: 32.0 2023-06-28 15:14:35,776 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-28 15:14:56,365 INFO [train.py:1028] (2/4) Epoch 12, validation: loss=0.2655, simple_loss=0.3539, pruned_loss=0.08861, over 1796401.00 frames. 2023-06-28 15:14:56,365 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 23793MB 2023-06-28 15:15:02,860 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.75 vs. limit=22.5 2023-06-28 15:15:13,903 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=2084706.0, ans=0.04949747468305833 2023-06-28 15:15:27,332 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=2084706.0, ans=0.125 2023-06-28 15:15:38,680 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2084766.0, ans=0.1 2023-06-28 15:15:57,094 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.607e+02 7.542e+02 1.127e+03 1.617e+03 2.900e+03, threshold=2.254e+03, percent-clipped=14.0 2023-06-28 15:15:59,449 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2084826.0, ans=0.1 2023-06-28 15:16:37,737 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=2084946.0, ans=0.07 2023-06-28 15:16:38,994 INFO [train.py:996] (2/4) Epoch 12, batch 12050, loss[loss=0.1928, simple_loss=0.2652, pruned_loss=0.06013, over 21811.00 frames. ], tot_loss[loss=0.2068, simple_loss=0.2896, pruned_loss=0.06203, over 4259250.52 frames. ], batch size: 247, lr: 2.43e-03, grad_scale: 32.0 2023-06-28 15:16:41,320 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2084946.0, ans=0.1 2023-06-28 15:16:41,324 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 15:18:16,640 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2085186.0, ans=0.125 2023-06-28 15:18:22,711 INFO [train.py:996] (2/4) Epoch 12, batch 12100, loss[loss=0.2549, simple_loss=0.3396, pruned_loss=0.08505, over 21437.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2947, pruned_loss=0.06665, over 4264538.33 frames. ], batch size: 131, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:19:28,431 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.830e+02 8.334e+02 1.059e+03 1.614e+03 4.516e+03, threshold=2.118e+03, percent-clipped=9.0 2023-06-28 15:19:30,078 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.90 vs. limit=15.0 2023-06-28 15:19:45,752 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2085426.0, ans=0.125 2023-06-28 15:20:08,888 INFO [train.py:996] (2/4) Epoch 12, batch 12150, loss[loss=0.1205, simple_loss=0.1607, pruned_loss=0.04012, over 17164.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2974, pruned_loss=0.0665, over 4257377.55 frames. ], batch size: 62, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:20:53,813 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2085666.0, ans=0.125 2023-06-28 15:21:22,956 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2085726.0, ans=0.125 2023-06-28 15:21:47,856 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2085786.0, ans=0.125 2023-06-28 15:21:50,478 INFO [train.py:996] (2/4) Epoch 12, batch 12200, loss[loss=0.1802, simple_loss=0.2537, pruned_loss=0.05329, over 21434.00 frames. ], tot_loss[loss=0.2129, simple_loss=0.2946, pruned_loss=0.06558, over 4263523.23 frames. ], batch size: 389, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:22:01,756 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=2085846.0, ans=15.0 2023-06-28 15:22:31,314 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=2085906.0, ans=0.04949747468305833 2023-06-28 15:22:34,642 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2085966.0, ans=0.125 2023-06-28 15:23:03,725 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.152e+02 7.335e+02 1.257e+03 1.849e+03 4.350e+03, threshold=2.514e+03, percent-clipped=17.0 2023-06-28 15:23:33,610 INFO [train.py:996] (2/4) Epoch 12, batch 12250, loss[loss=0.1522, simple_loss=0.2345, pruned_loss=0.0349, over 21539.00 frames. ], tot_loss[loss=0.2069, simple_loss=0.2868, pruned_loss=0.06349, over 4265763.78 frames. ], batch size: 195, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:23:42,338 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=2086146.0, ans=0.125 2023-06-28 15:24:16,513 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=16.01 vs. limit=22.5 2023-06-28 15:24:19,203 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2086266.0, ans=0.0 2023-06-28 15:24:22,708 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=2086266.0, ans=0.0 2023-06-28 15:25:16,908 INFO [train.py:996] (2/4) Epoch 12, batch 12300, loss[loss=0.1718, simple_loss=0.2641, pruned_loss=0.03979, over 21755.00 frames. ], tot_loss[loss=0.1983, simple_loss=0.2795, pruned_loss=0.05857, over 4264044.48 frames. ], batch size: 351, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:25:25,767 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2086446.0, ans=0.125 2023-06-28 15:25:25,785 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=2086446.0, ans=0.2 2023-06-28 15:25:52,285 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.65 vs. limit=12.0 2023-06-28 15:26:11,601 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2086566.0, ans=0.125 2023-06-28 15:26:26,717 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.10 vs. limit=15.0 2023-06-28 15:26:29,028 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.595e+02 6.953e+02 1.064e+03 1.818e+03 4.648e+03, threshold=2.128e+03, percent-clipped=12.0 2023-06-28 15:26:59,140 INFO [train.py:996] (2/4) Epoch 12, batch 12350, loss[loss=0.2218, simple_loss=0.2981, pruned_loss=0.07269, over 21782.00 frames. ], tot_loss[loss=0.1996, simple_loss=0.2829, pruned_loss=0.05817, over 4269432.56 frames. ], batch size: 298, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:27:03,446 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.10 vs. limit=22.5 2023-06-28 15:28:19,109 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2086926.0, ans=0.0 2023-06-28 15:28:40,055 INFO [train.py:996] (2/4) Epoch 12, batch 12400, loss[loss=0.26, simple_loss=0.3121, pruned_loss=0.1039, over 21802.00 frames. ], tot_loss[loss=0.2044, simple_loss=0.2854, pruned_loss=0.0617, over 4281588.88 frames. ], batch size: 508, lr: 2.43e-03, grad_scale: 32.0 2023-06-28 15:28:44,246 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=2087046.0, ans=0.0 2023-06-28 15:29:07,324 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.20 vs. limit=10.0 2023-06-28 15:29:40,208 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=2087166.0, ans=0.07 2023-06-28 15:29:54,407 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.210e+02 8.096e+02 1.137e+03 1.573e+03 3.341e+03, threshold=2.274e+03, percent-clipped=11.0 2023-06-28 15:30:32,678 INFO [train.py:996] (2/4) Epoch 12, batch 12450, loss[loss=0.1974, simple_loss=0.3075, pruned_loss=0.04365, over 19638.00 frames. ], tot_loss[loss=0.2097, simple_loss=0.2907, pruned_loss=0.06431, over 4281995.27 frames. ], batch size: 702, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:30:41,789 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=2087346.0, ans=0.125 2023-06-28 15:30:50,667 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2087406.0, ans=0.125 2023-06-28 15:31:13,475 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 15:31:24,926 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=2087466.0, ans=0.125 2023-06-28 15:31:30,186 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=2087466.0, ans=0.2 2023-06-28 15:32:00,632 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.90 vs. limit=15.0 2023-06-28 15:32:12,105 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.15 vs. limit=22.5 2023-06-28 15:32:16,007 INFO [train.py:996] (2/4) Epoch 12, batch 12500, loss[loss=0.2391, simple_loss=0.3408, pruned_loss=0.0687, over 21602.00 frames. ], tot_loss[loss=0.218, simple_loss=0.3003, pruned_loss=0.06787, over 4279729.13 frames. ], batch size: 230, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:32:21,763 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=2087646.0, ans=0.0 2023-06-28 15:32:58,732 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.72 vs. limit=22.5 2023-06-28 15:33:17,479 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=2087826.0, ans=0.0 2023-06-28 15:33:22,122 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.609e+02 8.449e+02 1.202e+03 1.905e+03 3.240e+03, threshold=2.404e+03, percent-clipped=12.0 2023-06-28 15:33:59,981 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2087946.0, ans=0.125 2023-06-28 15:34:05,881 INFO [train.py:996] (2/4) Epoch 12, batch 12550, loss[loss=0.3036, simple_loss=0.3643, pruned_loss=0.1214, over 21338.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.3059, pruned_loss=0.07043, over 4279980.15 frames. ], batch size: 507, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:34:30,305 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2088006.0, ans=0.125 2023-06-28 15:34:52,176 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.60 vs. limit=22.5 2023-06-28 15:35:50,640 INFO [train.py:996] (2/4) Epoch 12, batch 12600, loss[loss=0.1833, simple_loss=0.2724, pruned_loss=0.04711, over 21676.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.3055, pruned_loss=0.06834, over 4279614.20 frames. ], batch size: 247, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:36:22,721 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=2088306.0, ans=0.0 2023-06-28 15:36:56,734 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2088426.0, ans=0.125 2023-06-28 15:36:59,614 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.816e+02 8.006e+02 1.115e+03 1.640e+03 2.498e+03, threshold=2.229e+03, percent-clipped=4.0 2023-06-28 15:37:24,374 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.46 vs. limit=15.0 2023-06-28 15:37:27,030 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=2088486.0, ans=0.125 2023-06-28 15:37:31,419 INFO [train.py:996] (2/4) Epoch 12, batch 12650, loss[loss=0.1915, simple_loss=0.267, pruned_loss=0.05798, over 21870.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.2974, pruned_loss=0.06415, over 4282075.21 frames. ], batch size: 124, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:38:44,412 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2088726.0, ans=0.0 2023-06-28 15:39:09,636 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=2088786.0, ans=0.0 2023-06-28 15:39:18,845 INFO [train.py:996] (2/4) Epoch 12, batch 12700, loss[loss=0.2997, simple_loss=0.3482, pruned_loss=0.1256, over 21403.00 frames. ], tot_loss[loss=0.2147, simple_loss=0.2965, pruned_loss=0.06643, over 4289405.51 frames. ], batch size: 507, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:39:28,547 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.99 vs. limit=12.0 2023-06-28 15:39:35,140 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.18 vs. limit=6.0 2023-06-28 15:39:42,908 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2088906.0, ans=0.125 2023-06-28 15:40:07,257 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=2088966.0, ans=0.04949747468305833 2023-06-28 15:40:22,988 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.191e+02 7.639e+02 1.038e+03 1.743e+03 3.264e+03, threshold=2.075e+03, percent-clipped=12.0 2023-06-28 15:40:31,962 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=2089026.0, ans=0.0 2023-06-28 15:40:40,756 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.03 vs. limit=15.0 2023-06-28 15:40:41,780 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=2089086.0, ans=0.0 2023-06-28 15:41:01,162 INFO [train.py:996] (2/4) Epoch 12, batch 12750, loss[loss=0.2632, simple_loss=0.3294, pruned_loss=0.09849, over 21589.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.2969, pruned_loss=0.06634, over 4287117.74 frames. ], batch size: 507, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:41:49,807 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=8.48 vs. limit=12.0 2023-06-28 15:42:27,821 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.24 vs. limit=15.0 2023-06-28 15:42:37,380 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2089386.0, ans=0.1 2023-06-28 15:42:40,452 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2089386.0, ans=0.125 2023-06-28 15:42:43,113 INFO [train.py:996] (2/4) Epoch 12, batch 12800, loss[loss=0.2025, simple_loss=0.2691, pruned_loss=0.06799, over 21557.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2942, pruned_loss=0.0665, over 4282178.01 frames. ], batch size: 548, lr: 2.43e-03, grad_scale: 32.0 2023-06-28 15:43:23,618 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2089506.0, ans=0.1 2023-06-28 15:43:55,076 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.524e+02 8.190e+02 1.176e+03 1.690e+03 3.535e+03, threshold=2.352e+03, percent-clipped=8.0 2023-06-28 15:44:09,524 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.82 vs. limit=22.5 2023-06-28 15:44:27,203 INFO [train.py:996] (2/4) Epoch 12, batch 12850, loss[loss=0.2226, simple_loss=0.2968, pruned_loss=0.07423, over 21373.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2966, pruned_loss=0.06829, over 4286171.09 frames. ], batch size: 159, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:44:59,358 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2089806.0, ans=0.125 2023-06-28 15:45:35,430 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2089926.0, ans=0.0 2023-06-28 15:45:37,284 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=2089926.0, ans=0.2 2023-06-28 15:46:15,556 INFO [train.py:996] (2/4) Epoch 12, batch 12900, loss[loss=0.2299, simple_loss=0.3304, pruned_loss=0.06474, over 21168.00 frames. ], tot_loss[loss=0.2125, simple_loss=0.2946, pruned_loss=0.06513, over 4279233.92 frames. ], batch size: 548, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:47:13,236 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=2090166.0, ans=0.0 2023-06-28 15:47:25,954 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.842e+02 7.681e+02 1.209e+03 1.743e+03 3.932e+03, threshold=2.418e+03, percent-clipped=11.0 2023-06-28 15:47:28,902 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.97 vs. limit=15.0 2023-06-28 15:48:02,361 INFO [train.py:996] (2/4) Epoch 12, batch 12950, loss[loss=0.222, simple_loss=0.3047, pruned_loss=0.06962, over 21592.00 frames. ], tot_loss[loss=0.2094, simple_loss=0.2927, pruned_loss=0.06308, over 4283367.51 frames. ], batch size: 414, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:48:58,846 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2090466.0, ans=0.125 2023-06-28 15:49:00,518 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 15:49:03,860 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=2090526.0, ans=0.5 2023-06-28 15:49:07,152 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=2090526.0, ans=0.0 2023-06-28 15:49:08,989 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=2090526.0, ans=0.125 2023-06-28 15:49:21,883 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2090586.0, ans=0.0 2023-06-28 15:49:31,285 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.83 vs. limit=15.0 2023-06-28 15:49:32,379 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2090586.0, ans=0.125 2023-06-28 15:49:44,548 INFO [train.py:996] (2/4) Epoch 12, batch 13000, loss[loss=0.1303, simple_loss=0.1975, pruned_loss=0.03157, over 21808.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.2943, pruned_loss=0.06441, over 4270849.65 frames. ], batch size: 98, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:49:56,817 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2090646.0, ans=0.125 2023-06-28 15:50:50,319 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.116e+02 7.680e+02 1.047e+03 1.374e+03 2.853e+03, threshold=2.094e+03, percent-clipped=2.0 2023-06-28 15:51:03,784 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=2090886.0, ans=0.0 2023-06-28 15:51:11,109 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.16 vs. limit=15.0 2023-06-28 15:51:25,518 INFO [train.py:996] (2/4) Epoch 12, batch 13050, loss[loss=0.239, simple_loss=0.3041, pruned_loss=0.08698, over 21668.00 frames. ], tot_loss[loss=0.2075, simple_loss=0.2896, pruned_loss=0.06271, over 4270517.25 frames. ], batch size: 471, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:52:07,381 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.08 vs. limit=15.0 2023-06-28 15:53:01,503 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.65 vs. limit=15.0 2023-06-28 15:53:02,431 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2091186.0, ans=0.1 2023-06-28 15:53:04,875 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.41 vs. limit=15.0 2023-06-28 15:53:07,027 INFO [train.py:996] (2/4) Epoch 12, batch 13100, loss[loss=0.2367, simple_loss=0.3185, pruned_loss=0.07741, over 21842.00 frames. ], tot_loss[loss=0.2074, simple_loss=0.2896, pruned_loss=0.06259, over 4278696.13 frames. ], batch size: 124, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:53:40,224 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.73 vs. limit=15.0 2023-06-28 15:54:14,292 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.157e+02 6.908e+02 8.185e+02 1.188e+03 2.631e+03, threshold=1.637e+03, percent-clipped=2.0 2023-06-28 15:54:32,391 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.10 vs. limit=22.5 2023-06-28 15:54:50,927 INFO [train.py:996] (2/4) Epoch 12, batch 13150, loss[loss=0.1838, simple_loss=0.263, pruned_loss=0.0523, over 21566.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.2923, pruned_loss=0.06526, over 4275683.90 frames. ], batch size: 263, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:54:59,410 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=2091546.0, ans=0.95 2023-06-28 15:56:37,608 INFO [train.py:996] (2/4) Epoch 12, batch 13200, loss[loss=0.252, simple_loss=0.3215, pruned_loss=0.09123, over 21803.00 frames. ], tot_loss[loss=0.21, simple_loss=0.2909, pruned_loss=0.06455, over 4275976.13 frames. ], batch size: 441, lr: 2.43e-03, grad_scale: 32.0 2023-06-28 15:57:22,239 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=2091966.0, ans=0.125 2023-06-28 15:57:46,671 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.053e+02 7.241e+02 1.163e+03 1.743e+03 3.163e+03, threshold=2.326e+03, percent-clipped=27.0 2023-06-28 15:58:08,863 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=2092086.0, ans=0.2 2023-06-28 15:58:21,572 INFO [train.py:996] (2/4) Epoch 12, batch 13250, loss[loss=0.2153, simple_loss=0.2969, pruned_loss=0.06692, over 21864.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.2906, pruned_loss=0.06584, over 4280083.65 frames. ], batch size: 107, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:58:22,159 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=2092146.0, ans=0.5 2023-06-28 15:58:40,799 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=2092206.0, ans=0.125 2023-06-28 15:58:42,543 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=2092206.0, ans=0.2 2023-06-28 15:59:15,848 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=2092266.0, ans=0.0 2023-06-28 15:59:30,776 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=2092326.0, ans=0.125 2023-06-28 16:00:05,148 INFO [train.py:996] (2/4) Epoch 12, batch 13300, loss[loss=0.2089, simple_loss=0.2875, pruned_loss=0.06515, over 21261.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.2957, pruned_loss=0.06593, over 4282167.11 frames. ], batch size: 159, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 16:00:09,007 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=2092446.0, ans=0.125 2023-06-28 16:00:40,883 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=2092506.0, ans=10.0 2023-06-28 16:00:42,795 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2092506.0, ans=0.0 2023-06-28 16:00:52,725 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2092566.0, ans=0.125 2023-06-28 16:01:17,222 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=2092626.0, ans=0.125 2023-06-28 16:01:23,227 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.410e+02 8.803e+02 1.227e+03 2.112e+03 5.928e+03, threshold=2.454e+03, percent-clipped=20.0 2023-06-28 16:01:48,704 INFO [train.py:996] (2/4) Epoch 12, batch 13350, loss[loss=0.228, simple_loss=0.3132, pruned_loss=0.07143, over 21461.00 frames. ], tot_loss[loss=0.22, simple_loss=0.3024, pruned_loss=0.06879, over 4281300.70 frames. ], batch size: 211, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 16:02:04,045 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=2092746.0, ans=0.125 2023-06-28 16:02:25,130 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2092806.0, ans=0.1 2023-06-28 16:02:36,523 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2092866.0, ans=0.1 2023-06-28 16:02:38,263 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=2092866.0, ans=0.2 2023-06-28 16:03:14,876 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.97 vs. limit=12.0 2023-06-28 16:03:35,156 INFO [train.py:996] (2/4) Epoch 12, batch 13400, loss[loss=0.2103, simple_loss=0.2846, pruned_loss=0.068, over 21339.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.3022, pruned_loss=0.0702, over 4281730.15 frames. ], batch size: 143, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 16:03:54,419 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2093046.0, ans=0.1 2023-06-28 16:04:11,505 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.48 vs. limit=22.5 2023-06-28 16:04:44,340 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.284e+02 9.193e+02 1.348e+03 2.044e+03 4.158e+03, threshold=2.695e+03, percent-clipped=16.0 2023-06-28 16:05:08,138 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=2093286.0, ans=0.0 2023-06-28 16:05:14,146 INFO [train.py:996] (2/4) Epoch 12, batch 13450, loss[loss=0.209, simple_loss=0.285, pruned_loss=0.06652, over 21678.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.3029, pruned_loss=0.07228, over 4282923.67 frames. ], batch size: 332, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 16:05:25,189 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.49 vs. limit=15.0 2023-06-28 16:05:29,819 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2093346.0, ans=0.125 2023-06-28 16:05:41,279 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=2093406.0, ans=0.125 2023-06-28 16:06:58,132 INFO [train.py:996] (2/4) Epoch 12, batch 13500, loss[loss=0.248, simple_loss=0.3226, pruned_loss=0.08674, over 21700.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2928, pruned_loss=0.06894, over 4275636.54 frames. ], batch size: 391, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 16:06:58,952 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2093646.0, ans=0.125 2023-06-28 16:07:04,587 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.35 vs. limit=8.0 2023-06-28 16:07:54,288 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2093766.0, ans=0.0 2023-06-28 16:07:55,040 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.54 vs. limit=15.0 2023-06-28 16:08:13,368 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.877e+02 6.964e+02 1.038e+03 1.541e+03 3.052e+03, threshold=2.076e+03, percent-clipped=2.0 2023-06-28 16:08:28,297 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.54 vs. limit=15.0 2023-06-28 16:08:43,468 INFO [train.py:996] (2/4) Epoch 12, batch 13550, loss[loss=0.2593, simple_loss=0.3538, pruned_loss=0.08244, over 21794.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.2983, pruned_loss=0.06817, over 4275597.76 frames. ], batch size: 332, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 16:09:26,778 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2094066.0, ans=0.125 2023-06-28 16:09:30,245 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2094066.0, ans=0.125 2023-06-28 16:09:40,123 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=2094066.0, ans=0.125 2023-06-28 16:10:26,348 INFO [train.py:996] (2/4) Epoch 12, batch 13600, loss[loss=0.215, simple_loss=0.2993, pruned_loss=0.06534, over 21815.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.3001, pruned_loss=0.06854, over 4278811.01 frames. ], batch size: 414, lr: 2.42e-03, grad_scale: 32.0 2023-06-28 16:10:28,716 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2094246.0, ans=0.0 2023-06-28 16:10:53,530 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.67 vs. limit=22.5 2023-06-28 16:11:06,296 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2094306.0, ans=0.125 2023-06-28 16:11:14,471 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=2094366.0, ans=0.125 2023-06-28 16:11:39,038 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.839e+02 7.727e+02 1.210e+03 1.733e+03 4.112e+03, threshold=2.419e+03, percent-clipped=15.0 2023-06-28 16:11:47,416 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=2094426.0, ans=0.125 2023-06-28 16:12:04,362 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2094486.0, ans=0.0 2023-06-28 16:12:13,571 INFO [train.py:996] (2/4) Epoch 12, batch 13650, loss[loss=0.2025, simple_loss=0.2736, pruned_loss=0.06568, over 21844.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.2956, pruned_loss=0.06598, over 4282115.59 frames. ], batch size: 98, lr: 2.42e-03, grad_scale: 32.0 2023-06-28 16:12:36,216 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 16:13:04,482 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2094666.0, ans=0.125 2023-06-28 16:13:21,473 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.31 vs. limit=15.0 2023-06-28 16:13:22,456 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2094726.0, ans=0.1 2023-06-28 16:13:41,344 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.83 vs. limit=15.0 2023-06-28 16:13:57,091 INFO [train.py:996] (2/4) Epoch 12, batch 13700, loss[loss=0.1696, simple_loss=0.2334, pruned_loss=0.05291, over 15425.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2891, pruned_loss=0.06591, over 4273968.72 frames. ], batch size: 60, lr: 2.42e-03, grad_scale: 8.0 2023-06-28 16:14:23,961 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.48 vs. limit=15.0 2023-06-28 16:14:25,045 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 16:14:30,493 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2094906.0, ans=0.1 2023-06-28 16:15:02,965 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.67 vs. limit=15.0 2023-06-28 16:15:04,394 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2095026.0, ans=0.1 2023-06-28 16:15:15,684 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.251e+02 7.877e+02 1.121e+03 1.931e+03 5.975e+03, threshold=2.242e+03, percent-clipped=12.0 2023-06-28 16:15:47,604 INFO [train.py:996] (2/4) Epoch 12, batch 13750, loss[loss=0.1454, simple_loss=0.21, pruned_loss=0.04035, over 21733.00 frames. ], tot_loss[loss=0.2098, simple_loss=0.2876, pruned_loss=0.06599, over 4269795.15 frames. ], batch size: 124, lr: 2.42e-03, grad_scale: 8.0 2023-06-28 16:15:48,363 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2095146.0, ans=0.125 2023-06-28 16:15:49,891 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=2095146.0, ans=0.09899494936611666 2023-06-28 16:15:52,085 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=2095146.0, ans=0.2 2023-06-28 16:15:53,640 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2095146.0, ans=0.0 2023-06-28 16:15:55,715 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2095146.0, ans=0.1 2023-06-28 16:16:55,873 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=2095326.0, ans=0.5 2023-06-28 16:16:57,400 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=2095326.0, ans=0.025 2023-06-28 16:17:15,319 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.89 vs. limit=22.5 2023-06-28 16:17:24,880 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=2095386.0, ans=0.0 2023-06-28 16:17:34,228 INFO [train.py:996] (2/4) Epoch 12, batch 13800, loss[loss=0.2229, simple_loss=0.3462, pruned_loss=0.04977, over 19746.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.2942, pruned_loss=0.06449, over 4262288.21 frames. ], batch size: 702, lr: 2.42e-03, grad_scale: 8.0 2023-06-28 16:17:45,557 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=2095446.0, ans=0.0 2023-06-28 16:18:07,116 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2095506.0, ans=0.125 2023-06-28 16:18:53,739 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=2095626.0, ans=0.125 2023-06-28 16:18:56,445 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.457e+02 7.505e+02 1.008e+03 1.759e+03 5.617e+03, threshold=2.016e+03, percent-clipped=13.0 2023-06-28 16:19:18,271 INFO [train.py:996] (2/4) Epoch 12, batch 13850, loss[loss=0.2289, simple_loss=0.3092, pruned_loss=0.0743, over 21803.00 frames. ], tot_loss[loss=0.2156, simple_loss=0.3007, pruned_loss=0.06524, over 4256747.38 frames. ], batch size: 282, lr: 2.42e-03, grad_scale: 8.0 2023-06-28 16:20:03,282 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2095806.0, ans=0.0 2023-06-28 16:21:05,044 INFO [train.py:996] (2/4) Epoch 12, batch 13900, loss[loss=0.2308, simple_loss=0.2998, pruned_loss=0.0809, over 21676.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.3024, pruned_loss=0.06703, over 4261417.66 frames. ], batch size: 389, lr: 2.42e-03, grad_scale: 8.0 2023-06-28 16:22:04,645 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=2096166.0, ans=0.2 2023-06-28 16:22:20,627 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.853e+02 9.390e+02 1.248e+03 1.935e+03 5.140e+03, threshold=2.497e+03, percent-clipped=23.0 2023-06-28 16:22:22,926 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2096226.0, ans=0.125 2023-06-28 16:22:47,377 INFO [train.py:996] (2/4) Epoch 12, batch 13950, loss[loss=0.2223, simple_loss=0.298, pruned_loss=0.07326, over 21635.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.302, pruned_loss=0.06929, over 4268039.19 frames. ], batch size: 230, lr: 2.42e-03, grad_scale: 8.0 2023-06-28 16:23:49,658 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=15.75 vs. limit=15.0 2023-06-28 16:24:25,030 INFO [train.py:996] (2/4) Epoch 12, batch 14000, loss[loss=0.1866, simple_loss=0.2773, pruned_loss=0.04789, over 21644.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.2993, pruned_loss=0.06785, over 4262581.70 frames. ], batch size: 263, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 16:24:25,713 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2096646.0, ans=0.1 2023-06-28 16:24:45,391 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2096646.0, ans=0.1 2023-06-28 16:25:45,214 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.785e+02 7.339e+02 1.044e+03 1.507e+03 3.234e+03, threshold=2.088e+03, percent-clipped=5.0 2023-06-28 16:25:55,552 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 16:25:57,269 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2096886.0, ans=0.125 2023-06-28 16:26:11,281 INFO [train.py:996] (2/4) Epoch 12, batch 14050, loss[loss=0.1729, simple_loss=0.2525, pruned_loss=0.04668, over 21809.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2932, pruned_loss=0.06377, over 4266223.53 frames. ], batch size: 118, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 16:26:11,889 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2096946.0, ans=0.1 2023-06-28 16:26:55,824 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=2097066.0, ans=0.125 2023-06-28 16:27:10,076 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2097066.0, ans=0.125 2023-06-28 16:27:26,668 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 16:27:52,890 INFO [train.py:996] (2/4) Epoch 12, batch 14100, loss[loss=0.2106, simple_loss=0.2804, pruned_loss=0.07037, over 21644.00 frames. ], tot_loss[loss=0.2054, simple_loss=0.285, pruned_loss=0.0629, over 4262584.71 frames. ], batch size: 298, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 16:28:36,232 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2097366.0, ans=0.1 2023-06-28 16:28:44,404 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=2097366.0, ans=0.2 2023-06-28 16:29:08,549 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.981e+02 8.924e+02 1.261e+03 1.864e+03 4.328e+03, threshold=2.523e+03, percent-clipped=18.0 2023-06-28 16:29:29,314 INFO [train.py:996] (2/4) Epoch 12, batch 14150, loss[loss=0.1949, simple_loss=0.2883, pruned_loss=0.05079, over 21419.00 frames. ], tot_loss[loss=0.2085, simple_loss=0.2891, pruned_loss=0.06399, over 4273824.66 frames. ], batch size: 194, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 16:29:31,286 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2097546.0, ans=0.0 2023-06-28 16:30:31,140 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=2097726.0, ans=0.2 2023-06-28 16:30:50,660 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2097786.0, ans=0.1 2023-06-28 16:30:51,310 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.95 vs. limit=22.5 2023-06-28 16:31:02,029 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2097786.0, ans=0.125 2023-06-28 16:31:02,584 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.21 vs. limit=15.0 2023-06-28 16:31:07,778 INFO [train.py:996] (2/4) Epoch 12, batch 14200, loss[loss=0.2119, simple_loss=0.2883, pruned_loss=0.06774, over 21331.00 frames. ], tot_loss[loss=0.2086, simple_loss=0.2898, pruned_loss=0.06367, over 4265243.11 frames. ], batch size: 159, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 16:31:08,571 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2097846.0, ans=0.0 2023-06-28 16:32:10,416 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2098026.0, ans=0.125 2023-06-28 16:32:16,668 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2098026.0, ans=0.1 2023-06-28 16:32:21,329 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.921e+02 6.799e+02 8.924e+02 1.241e+03 3.377e+03, threshold=1.785e+03, percent-clipped=4.0 2023-06-28 16:32:36,794 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2098086.0, ans=0.1 2023-06-28 16:32:38,683 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=2098086.0, ans=0.0 2023-06-28 16:32:44,127 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.37 vs. limit=12.0 2023-06-28 16:32:47,807 INFO [train.py:996] (2/4) Epoch 12, batch 14250, loss[loss=0.1687, simple_loss=0.2462, pruned_loss=0.0456, over 21569.00 frames. ], tot_loss[loss=0.2051, simple_loss=0.2832, pruned_loss=0.06348, over 4261074.64 frames. ], batch size: 263, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 16:33:24,827 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2098206.0, ans=0.125 2023-06-28 16:33:53,636 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2098326.0, ans=0.125 2023-06-28 16:34:16,140 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2098386.0, ans=0.0 2023-06-28 16:34:19,463 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2098386.0, ans=0.125 2023-06-28 16:34:26,126 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=2098386.0, ans=0.125 2023-06-28 16:34:34,329 INFO [train.py:996] (2/4) Epoch 12, batch 14300, loss[loss=0.2381, simple_loss=0.3341, pruned_loss=0.07102, over 21617.00 frames. ], tot_loss[loss=0.2047, simple_loss=0.2851, pruned_loss=0.06219, over 4260570.07 frames. ], batch size: 263, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 16:34:43,498 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2098446.0, ans=0.125 2023-06-28 16:35:26,501 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=2098566.0, ans=0.0 2023-06-28 16:35:55,449 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.012e+02 8.237e+02 1.255e+03 2.124e+03 4.385e+03, threshold=2.511e+03, percent-clipped=34.0 2023-06-28 16:36:17,110 INFO [train.py:996] (2/4) Epoch 12, batch 14350, loss[loss=0.1858, simple_loss=0.2665, pruned_loss=0.05251, over 21458.00 frames. ], tot_loss[loss=0.2089, simple_loss=0.2918, pruned_loss=0.06306, over 4241605.67 frames. ], batch size: 194, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 16:36:21,141 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=2098746.0, ans=0.0 2023-06-28 16:36:41,965 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=2098806.0, ans=0.0 2023-06-28 16:36:47,632 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.80 vs. limit=15.0 2023-06-28 16:37:00,390 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2098806.0, ans=0.1 2023-06-28 16:37:03,760 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2098866.0, ans=0.125 2023-06-28 16:37:38,858 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.22 vs. limit=22.5 2023-06-28 16:37:54,682 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2098986.0, ans=0.125 2023-06-28 16:37:56,518 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=2098986.0, ans=0.025 2023-06-28 16:37:59,253 INFO [train.py:996] (2/4) Epoch 12, batch 14400, loss[loss=0.1871, simple_loss=0.268, pruned_loss=0.05308, over 21820.00 frames. ], tot_loss[loss=0.2097, simple_loss=0.2907, pruned_loss=0.0644, over 4253464.14 frames. ], batch size: 316, lr: 2.42e-03, grad_scale: 32.0 2023-06-28 16:38:01,921 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.55 vs. limit=15.0 2023-06-28 16:38:12,521 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2099046.0, ans=0.125 2023-06-28 16:38:47,579 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2099166.0, ans=0.1 2023-06-28 16:38:55,849 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2099166.0, ans=0.125 2023-06-28 16:39:00,578 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=2099226.0, ans=0.95 2023-06-28 16:39:16,023 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.61 vs. limit=15.0 2023-06-28 16:39:18,232 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.847e+02 6.927e+02 1.038e+03 1.645e+03 3.908e+03, threshold=2.076e+03, percent-clipped=8.0 2023-06-28 16:39:39,687 INFO [train.py:996] (2/4) Epoch 12, batch 14450, loss[loss=0.1797, simple_loss=0.2447, pruned_loss=0.0573, over 21528.00 frames. ], tot_loss[loss=0.2076, simple_loss=0.285, pruned_loss=0.0651, over 4263350.08 frames. ], batch size: 212, lr: 2.42e-03, grad_scale: 32.0 2023-06-28 16:40:06,095 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=2099406.0, ans=0.2 2023-06-28 16:40:42,532 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2099466.0, ans=0.1 2023-06-28 16:40:44,011 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2099526.0, ans=0.0 2023-06-28 16:41:23,322 INFO [train.py:996] (2/4) Epoch 12, batch 14500, loss[loss=0.196, simple_loss=0.2756, pruned_loss=0.05823, over 21773.00 frames. ], tot_loss[loss=0.205, simple_loss=0.2807, pruned_loss=0.06471, over 4272998.34 frames. ], batch size: 98, lr: 2.42e-03, grad_scale: 32.0 2023-06-28 16:41:24,017 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=2099646.0, ans=0.1 2023-06-28 16:41:40,758 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.54 vs. limit=15.0 2023-06-28 16:42:05,709 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.48 vs. limit=10.0 2023-06-28 16:42:46,536 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.162e+02 7.722e+02 1.013e+03 1.611e+03 2.945e+03, threshold=2.026e+03, percent-clipped=11.0 2023-06-28 16:43:11,636 INFO [train.py:996] (2/4) Epoch 12, batch 14550, loss[loss=0.2796, simple_loss=0.3477, pruned_loss=0.1057, over 21434.00 frames. ], tot_loss[loss=0.2082, simple_loss=0.2844, pruned_loss=0.06606, over 4274957.44 frames. ], batch size: 471, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 16:43:38,685 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2100006.0, ans=0.0 2023-06-28 16:44:20,973 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2100126.0, ans=0.0 2023-06-28 16:44:31,756 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2100126.0, ans=0.125 2023-06-28 16:44:59,763 INFO [train.py:996] (2/4) Epoch 12, batch 14600, loss[loss=0.2415, simple_loss=0.3272, pruned_loss=0.07786, over 21744.00 frames. ], tot_loss[loss=0.2164, simple_loss=0.2928, pruned_loss=0.06997, over 4279122.05 frames. ], batch size: 441, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 16:45:31,896 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=2100306.0, ans=0.0 2023-06-28 16:46:12,062 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.268e+02 8.624e+02 1.300e+03 2.155e+03 4.412e+03, threshold=2.599e+03, percent-clipped=26.0 2023-06-28 16:46:14,308 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2100486.0, ans=0.0 2023-06-28 16:46:41,595 INFO [train.py:996] (2/4) Epoch 12, batch 14650, loss[loss=0.2094, simple_loss=0.2849, pruned_loss=0.067, over 21264.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2965, pruned_loss=0.06975, over 4261391.17 frames. ], batch size: 159, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 16:47:50,219 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=2100726.0, ans=0.0 2023-06-28 16:48:10,462 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2100786.0, ans=0.125 2023-06-28 16:48:24,619 INFO [train.py:996] (2/4) Epoch 12, batch 14700, loss[loss=0.1864, simple_loss=0.2721, pruned_loss=0.05031, over 21274.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.2921, pruned_loss=0.06514, over 4247531.68 frames. ], batch size: 176, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 16:48:50,871 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=2100906.0, ans=0.2 2023-06-28 16:49:12,678 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=2100966.0, ans=0.0 2023-06-28 16:49:40,144 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.485e+02 7.512e+02 1.036e+03 1.553e+03 3.154e+03, threshold=2.072e+03, percent-clipped=4.0 2023-06-28 16:49:49,069 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=2101086.0, ans=0.07 2023-06-28 16:50:15,502 INFO [train.py:996] (2/4) Epoch 12, batch 14750, loss[loss=0.3158, simple_loss=0.3798, pruned_loss=0.1259, over 21482.00 frames. ], tot_loss[loss=0.2156, simple_loss=0.297, pruned_loss=0.0671, over 4250689.66 frames. ], batch size: 471, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 16:50:19,780 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2101146.0, ans=0.0 2023-06-28 16:50:40,748 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.49 vs. limit=15.0 2023-06-28 16:50:57,302 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2101266.0, ans=0.1 2023-06-28 16:51:44,865 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=2101386.0, ans=0.2 2023-06-28 16:51:51,261 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=2101386.0, ans=0.125 2023-06-28 16:51:54,508 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2101386.0, ans=0.0 2023-06-28 16:51:58,895 INFO [train.py:996] (2/4) Epoch 12, batch 14800, loss[loss=0.2161, simple_loss=0.2818, pruned_loss=0.0752, over 21117.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.3095, pruned_loss=0.07247, over 4256808.52 frames. ], batch size: 143, lr: 2.42e-03, grad_scale: 32.0 2023-06-28 16:52:06,358 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2101446.0, ans=0.1 2023-06-28 16:52:06,405 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2101446.0, ans=0.0 2023-06-28 16:52:19,544 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2101506.0, ans=0.125 2023-06-28 16:52:29,471 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.67 vs. limit=5.0 2023-06-28 16:53:15,278 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=2101626.0, ans=0.2 2023-06-28 16:53:24,531 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.184e+02 7.719e+02 1.255e+03 2.135e+03 5.182e+03, threshold=2.510e+03, percent-clipped=29.0 2023-06-28 16:53:39,452 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=2101686.0, ans=0.0 2023-06-28 16:53:50,412 INFO [train.py:996] (2/4) Epoch 12, batch 14850, loss[loss=0.1784, simple_loss=0.2486, pruned_loss=0.05409, over 21102.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.3027, pruned_loss=0.07193, over 4250542.24 frames. ], batch size: 143, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 16:54:54,927 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=2101926.0, ans=0.125 2023-06-28 16:55:33,719 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=2102046.0, ans=0.0 2023-06-28 16:55:34,853 INFO [train.py:996] (2/4) Epoch 12, batch 14900, loss[loss=0.217, simple_loss=0.2938, pruned_loss=0.07006, over 21523.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.3036, pruned_loss=0.07247, over 4256162.77 frames. ], batch size: 194, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 16:55:37,025 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 16:56:34,097 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2102226.0, ans=0.125 2023-06-28 16:56:37,568 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2102226.0, ans=0.125 2023-06-28 16:56:47,515 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=2102226.0, ans=0.0 2023-06-28 16:56:55,265 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.996e+02 9.079e+02 1.317e+03 1.882e+03 4.138e+03, threshold=2.634e+03, percent-clipped=10.0 2023-06-28 16:57:14,115 INFO [train.py:996] (2/4) Epoch 12, batch 14950, loss[loss=0.1821, simple_loss=0.2787, pruned_loss=0.04275, over 21891.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.3042, pruned_loss=0.0721, over 4262473.07 frames. ], batch size: 317, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 16:57:18,267 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=2102346.0, ans=0.2 2023-06-28 16:57:40,030 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=2102406.0, ans=0.125 2023-06-28 16:57:48,723 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.00 vs. limit=15.0 2023-06-28 16:57:49,625 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=2102406.0, ans=0.125 2023-06-28 16:58:40,743 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.46 vs. limit=15.0 2023-06-28 16:58:52,429 INFO [train.py:996] (2/4) Epoch 12, batch 15000, loss[loss=0.197, simple_loss=0.2731, pruned_loss=0.06043, over 21674.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.3046, pruned_loss=0.07233, over 4267290.86 frames. ], batch size: 263, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 16:58:52,429 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-28 16:59:11,976 INFO [train.py:1028] (2/4) Epoch 12, validation: loss=0.2573, simple_loss=0.3458, pruned_loss=0.08437, over 1796401.00 frames. 2023-06-28 16:59:11,977 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 23793MB 2023-06-28 16:59:47,265 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.95 vs. limit=15.0 2023-06-28 16:59:53,691 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=2102766.0, ans=0.2 2023-06-28 17:00:24,347 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2102826.0, ans=0.125 2023-06-28 17:00:28,472 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.412e+02 7.471e+02 1.040e+03 1.542e+03 3.461e+03, threshold=2.079e+03, percent-clipped=2.0 2023-06-28 17:00:57,516 INFO [train.py:996] (2/4) Epoch 12, batch 15050, loss[loss=0.2036, simple_loss=0.2926, pruned_loss=0.05727, over 21594.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.3038, pruned_loss=0.07248, over 4266171.52 frames. ], batch size: 230, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:01:16,987 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.78 vs. limit=15.0 2023-06-28 17:01:18,113 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=2103006.0, ans=0.2 2023-06-28 17:01:23,351 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2103006.0, ans=0.0 2023-06-28 17:01:29,675 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=2103006.0, ans=0.0 2023-06-28 17:01:33,154 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2103006.0, ans=0.125 2023-06-28 17:01:58,320 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.29 vs. limit=15.0 2023-06-28 17:02:45,492 INFO [train.py:996] (2/4) Epoch 12, batch 15100, loss[loss=0.2397, simple_loss=0.3278, pruned_loss=0.0758, over 19897.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.3061, pruned_loss=0.07211, over 4265566.86 frames. ], batch size: 703, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:04:01,204 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.61 vs. limit=15.0 2023-06-28 17:04:04,842 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.134e+02 7.864e+02 1.140e+03 1.681e+03 3.504e+03, threshold=2.280e+03, percent-clipped=13.0 2023-06-28 17:04:21,521 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2103486.0, ans=0.0 2023-06-28 17:04:27,711 INFO [train.py:996] (2/4) Epoch 12, batch 15150, loss[loss=0.1995, simple_loss=0.2673, pruned_loss=0.0658, over 21814.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.3021, pruned_loss=0.07263, over 4257380.86 frames. ], batch size: 372, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:04:28,880 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.52 vs. limit=22.5 2023-06-28 17:06:10,670 INFO [train.py:996] (2/4) Epoch 12, batch 15200, loss[loss=0.1983, simple_loss=0.267, pruned_loss=0.06481, over 21145.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2929, pruned_loss=0.06891, over 4262984.94 frames. ], batch size: 143, lr: 2.42e-03, grad_scale: 32.0 2023-06-28 17:06:11,392 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=2103846.0, ans=0.2 2023-06-28 17:06:13,455 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.88 vs. limit=12.0 2023-06-28 17:06:15,125 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.02 vs. limit=10.0 2023-06-28 17:07:18,519 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 17:07:34,249 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.768e+02 7.003e+02 9.713e+02 1.349e+03 2.577e+03, threshold=1.943e+03, percent-clipped=4.0 2023-06-28 17:07:52,328 INFO [train.py:996] (2/4) Epoch 12, batch 15250, loss[loss=0.2325, simple_loss=0.2997, pruned_loss=0.08268, over 21187.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.2881, pruned_loss=0.0676, over 4255386.45 frames. ], batch size: 143, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:07:55,375 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.47 vs. limit=12.0 2023-06-28 17:08:44,397 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.07 vs. limit=6.0 2023-06-28 17:08:53,595 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=2104266.0, ans=0.035 2023-06-28 17:09:34,149 INFO [train.py:996] (2/4) Epoch 12, batch 15300, loss[loss=0.2324, simple_loss=0.3095, pruned_loss=0.07769, over 21656.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.2899, pruned_loss=0.06945, over 4261768.45 frames. ], batch size: 351, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:09:44,427 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=2104446.0, ans=0.125 2023-06-28 17:09:58,659 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.51 vs. limit=15.0 2023-06-28 17:10:04,867 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2104506.0, ans=0.0 2023-06-28 17:10:06,880 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2104506.0, ans=0.0 2023-06-28 17:10:33,671 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2104566.0, ans=0.125 2023-06-28 17:11:01,320 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.828e+02 9.652e+02 1.202e+03 1.838e+03 3.602e+03, threshold=2.404e+03, percent-clipped=24.0 2023-06-28 17:11:17,463 INFO [train.py:996] (2/4) Epoch 12, batch 15350, loss[loss=0.207, simple_loss=0.3081, pruned_loss=0.05296, over 21449.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.2953, pruned_loss=0.07139, over 4263821.13 frames. ], batch size: 194, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:11:44,984 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.11 vs. limit=10.0 2023-06-28 17:11:53,040 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.58 vs. limit=10.0 2023-06-28 17:12:03,768 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=2104866.0, ans=0.0 2023-06-28 17:12:09,793 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2104866.0, ans=0.125 2023-06-28 17:12:56,973 INFO [train.py:996] (2/4) Epoch 12, batch 15400, loss[loss=0.1893, simple_loss=0.2744, pruned_loss=0.05207, over 21918.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.2974, pruned_loss=0.07018, over 4271626.68 frames. ], batch size: 316, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:13:11,769 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.72 vs. limit=15.0 2023-06-28 17:13:22,992 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.97 vs. limit=15.0 2023-06-28 17:14:16,125 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.384e+02 7.580e+02 1.010e+03 1.519e+03 4.001e+03, threshold=2.021e+03, percent-clipped=6.0 2023-06-28 17:14:21,801 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 17:14:21,845 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2105286.0, ans=0.0 2023-06-28 17:14:23,341 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2105286.0, ans=0.0 2023-06-28 17:14:23,415 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2105286.0, ans=0.0 2023-06-28 17:14:37,996 INFO [train.py:996] (2/4) Epoch 12, batch 15450, loss[loss=0.2039, simple_loss=0.2893, pruned_loss=0.05924, over 21820.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2951, pruned_loss=0.06958, over 4268488.52 frames. ], batch size: 332, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:15:45,371 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.18 vs. limit=15.0 2023-06-28 17:16:04,987 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.70 vs. limit=15.0 2023-06-28 17:16:19,672 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=2105646.0, ans=0.125 2023-06-28 17:16:20,946 INFO [train.py:996] (2/4) Epoch 12, batch 15500, loss[loss=0.26, simple_loss=0.3369, pruned_loss=0.09162, over 21803.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2976, pruned_loss=0.06916, over 4263998.25 frames. ], batch size: 441, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:16:41,321 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2105706.0, ans=0.1 2023-06-28 17:17:07,326 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2105766.0, ans=0.125 2023-06-28 17:17:43,595 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2105826.0, ans=0.0 2023-06-28 17:17:46,455 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.645e+02 8.205e+02 1.251e+03 1.746e+03 3.424e+03, threshold=2.502e+03, percent-clipped=13.0 2023-06-28 17:18:07,378 INFO [train.py:996] (2/4) Epoch 12, batch 15550, loss[loss=0.2241, simple_loss=0.3325, pruned_loss=0.05783, over 19772.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2962, pruned_loss=0.0674, over 4270756.22 frames. ], batch size: 703, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:18:22,073 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.28 vs. limit=15.0 2023-06-28 17:18:36,624 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=2106006.0, ans=0.0 2023-06-28 17:18:50,006 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2106066.0, ans=0.0 2023-06-28 17:19:20,067 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=2106126.0, ans=0.125 2023-06-28 17:19:50,311 INFO [train.py:996] (2/4) Epoch 12, batch 15600, loss[loss=0.1905, simple_loss=0.2566, pruned_loss=0.06223, over 21611.00 frames. ], tot_loss[loss=0.2109, simple_loss=0.2903, pruned_loss=0.06577, over 4270622.52 frames. ], batch size: 298, lr: 2.42e-03, grad_scale: 32.0 2023-06-28 17:20:13,311 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=2106306.0, ans=0.0 2023-06-28 17:20:22,834 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2106306.0, ans=0.0 2023-06-28 17:20:48,589 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=2106366.0, ans=10.0 2023-06-28 17:21:05,985 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=2106426.0, ans=0.125 2023-06-28 17:21:06,463 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.72 vs. limit=6.0 2023-06-28 17:21:08,513 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.963e+02 9.239e+02 1.318e+03 1.838e+03 4.350e+03, threshold=2.636e+03, percent-clipped=8.0 2023-06-28 17:21:29,750 INFO [train.py:996] (2/4) Epoch 12, batch 15650, loss[loss=0.1868, simple_loss=0.2629, pruned_loss=0.05536, over 21766.00 frames. ], tot_loss[loss=0.2096, simple_loss=0.289, pruned_loss=0.06507, over 4264150.60 frames. ], batch size: 124, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:22:04,737 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2106606.0, ans=0.125 2023-06-28 17:22:13,821 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.00 vs. limit=5.0 2023-06-28 17:23:06,714 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=2106786.0, ans=0.125 2023-06-28 17:23:12,649 INFO [train.py:996] (2/4) Epoch 12, batch 15700, loss[loss=0.1999, simple_loss=0.2847, pruned_loss=0.05754, over 21764.00 frames. ], tot_loss[loss=0.2078, simple_loss=0.2867, pruned_loss=0.0644, over 4262721.59 frames. ], batch size: 351, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:24:39,911 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.792e+02 8.496e+02 1.514e+03 2.181e+03 4.345e+03, threshold=3.028e+03, percent-clipped=16.0 2023-06-28 17:24:53,530 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=2107146.0, ans=0.125 2023-06-28 17:24:54,652 INFO [train.py:996] (2/4) Epoch 12, batch 15750, loss[loss=0.1836, simple_loss=0.2604, pruned_loss=0.05337, over 21637.00 frames. ], tot_loss[loss=0.2049, simple_loss=0.2824, pruned_loss=0.06368, over 4269051.70 frames. ], batch size: 298, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:24:55,842 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.83 vs. limit=12.0 2023-06-28 17:25:33,895 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=2107206.0, ans=0.0 2023-06-28 17:25:57,998 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=2107326.0, ans=0.2 2023-06-28 17:26:35,208 INFO [train.py:996] (2/4) Epoch 12, batch 15800, loss[loss=0.2148, simple_loss=0.2809, pruned_loss=0.07439, over 19989.00 frames. ], tot_loss[loss=0.2019, simple_loss=0.2771, pruned_loss=0.06333, over 4267087.93 frames. ], batch size: 702, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:26:59,067 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=2107446.0, ans=0.0 2023-06-28 17:27:34,909 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.24 vs. limit=15.0 2023-06-28 17:27:46,740 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.55 vs. limit=15.0 2023-06-28 17:27:56,135 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2107686.0, ans=0.125 2023-06-28 17:28:01,460 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.867e+02 7.167e+02 8.955e+02 1.687e+03 3.256e+03, threshold=1.791e+03, percent-clipped=1.0 2023-06-28 17:28:16,333 INFO [train.py:996] (2/4) Epoch 12, batch 15850, loss[loss=0.2145, simple_loss=0.2839, pruned_loss=0.07259, over 21226.00 frames. ], tot_loss[loss=0.2047, simple_loss=0.2788, pruned_loss=0.06534, over 4261771.83 frames. ], batch size: 176, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:28:41,431 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=2107806.0, ans=0.2 2023-06-28 17:28:43,673 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.65 vs. limit=10.0 2023-06-28 17:28:46,317 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=2107806.0, ans=0.2 2023-06-28 17:29:53,388 INFO [train.py:996] (2/4) Epoch 12, batch 15900, loss[loss=0.2414, simple_loss=0.3055, pruned_loss=0.08863, over 21323.00 frames. ], tot_loss[loss=0.2038, simple_loss=0.2761, pruned_loss=0.06574, over 4264996.38 frames. ], batch size: 507, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:30:56,276 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=2108226.0, ans=0.0 2023-06-28 17:31:15,601 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.841e+02 7.381e+02 9.815e+02 1.486e+03 2.540e+03, threshold=1.963e+03, percent-clipped=11.0 2023-06-28 17:31:22,317 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2108286.0, ans=0.125 2023-06-28 17:31:34,458 INFO [train.py:996] (2/4) Epoch 12, batch 15950, loss[loss=0.1704, simple_loss=0.268, pruned_loss=0.03638, over 21860.00 frames. ], tot_loss[loss=0.2029, simple_loss=0.2782, pruned_loss=0.06375, over 4250479.80 frames. ], batch size: 316, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:31:43,387 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=2108346.0, ans=0.035 2023-06-28 17:31:44,036 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.01 vs. limit=15.0 2023-06-28 17:31:48,476 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=2108346.0, ans=0.2 2023-06-28 17:32:07,983 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=2108406.0, ans=0.0 2023-06-28 17:32:21,100 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=2108466.0, ans=0.2 2023-06-28 17:33:11,198 INFO [train.py:996] (2/4) Epoch 12, batch 16000, loss[loss=0.1833, simple_loss=0.2772, pruned_loss=0.04471, over 21782.00 frames. ], tot_loss[loss=0.2014, simple_loss=0.2799, pruned_loss=0.06139, over 4251003.65 frames. ], batch size: 247, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:33:29,440 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 17:33:37,360 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.08 vs. limit=6.0 2023-06-28 17:33:43,212 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.45 vs. limit=15.0 2023-06-28 17:34:12,504 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2108766.0, ans=0.1 2023-06-28 17:34:20,451 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=2108826.0, ans=0.2 2023-06-28 17:34:39,582 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.641e+02 6.614e+02 9.934e+02 1.443e+03 3.349e+03, threshold=1.987e+03, percent-clipped=8.0 2023-06-28 17:34:41,735 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2108886.0, ans=0.125 2023-06-28 17:34:52,876 INFO [train.py:996] (2/4) Epoch 12, batch 16050, loss[loss=0.2426, simple_loss=0.3455, pruned_loss=0.0698, over 21639.00 frames. ], tot_loss[loss=0.2012, simple_loss=0.2825, pruned_loss=0.05991, over 4262554.03 frames. ], batch size: 414, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:35:10,405 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.99 vs. limit=5.0 2023-06-28 17:35:35,510 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 17:35:38,728 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=2109066.0, ans=0.09899494936611666 2023-06-28 17:35:38,739 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2109066.0, ans=0.1 2023-06-28 17:35:50,022 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2109126.0, ans=0.1 2023-06-28 17:36:07,826 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.89 vs. limit=15.0 2023-06-28 17:36:28,144 INFO [train.py:996] (2/4) Epoch 12, batch 16100, loss[loss=0.19, simple_loss=0.252, pruned_loss=0.06402, over 21236.00 frames. ], tot_loss[loss=0.2045, simple_loss=0.2861, pruned_loss=0.06144, over 4263570.55 frames. ], batch size: 608, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:37:11,809 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=2109366.0, ans=0.125 2023-06-28 17:37:20,244 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=2109366.0, ans=0.09899494936611666 2023-06-28 17:37:52,810 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.516e+02 1.028e+03 1.550e+03 2.496e+03 6.023e+03, threshold=3.100e+03, percent-clipped=39.0 2023-06-28 17:38:06,339 INFO [train.py:996] (2/4) Epoch 12, batch 16150, loss[loss=0.2358, simple_loss=0.3275, pruned_loss=0.07202, over 17617.00 frames. ], tot_loss[loss=0.2059, simple_loss=0.2851, pruned_loss=0.06328, over 4274546.15 frames. ], batch size: 60, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:39:49,881 INFO [train.py:996] (2/4) Epoch 12, batch 16200, loss[loss=0.2713, simple_loss=0.3464, pruned_loss=0.09811, over 21454.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.2924, pruned_loss=0.06541, over 4277681.63 frames. ], batch size: 131, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:39:51,401 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.94 vs. limit=15.0 2023-06-28 17:40:15,804 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=2109906.0, ans=0.025 2023-06-28 17:40:56,640 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.04 vs. limit=15.0 2023-06-28 17:41:01,412 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2110026.0, ans=0.1 2023-06-28 17:41:21,145 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.167e+02 9.228e+02 1.472e+03 2.186e+03 5.217e+03, threshold=2.943e+03, percent-clipped=8.0 2023-06-28 17:41:27,278 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2110086.0, ans=0.0 2023-06-28 17:41:39,809 INFO [train.py:996] (2/4) Epoch 12, batch 16250, loss[loss=0.1645, simple_loss=0.2492, pruned_loss=0.03995, over 21381.00 frames. ], tot_loss[loss=0.213, simple_loss=0.2941, pruned_loss=0.06599, over 4272043.26 frames. ], batch size: 211, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:41:56,816 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2110146.0, ans=0.125 2023-06-28 17:42:00,106 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=2110206.0, ans=0.2 2023-06-28 17:43:06,608 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=2110386.0, ans=0.125 2023-06-28 17:43:22,332 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.67 vs. limit=22.5 2023-06-28 17:43:22,709 INFO [train.py:996] (2/4) Epoch 12, batch 16300, loss[loss=0.1787, simple_loss=0.2607, pruned_loss=0.04839, over 21705.00 frames. ], tot_loss[loss=0.2056, simple_loss=0.2866, pruned_loss=0.06231, over 4264527.55 frames. ], batch size: 351, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:43:39,730 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2110446.0, ans=0.1 2023-06-28 17:44:47,671 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.729e+02 7.899e+02 1.103e+03 1.681e+03 3.393e+03, threshold=2.206e+03, percent-clipped=5.0 2023-06-28 17:45:01,732 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2110686.0, ans=0.125 2023-06-28 17:45:06,100 INFO [train.py:996] (2/4) Epoch 12, batch 16350, loss[loss=0.2157, simple_loss=0.2966, pruned_loss=0.06736, over 21701.00 frames. ], tot_loss[loss=0.2037, simple_loss=0.2839, pruned_loss=0.06174, over 4264939.93 frames. ], batch size: 298, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 17:45:41,680 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2110806.0, ans=0.0 2023-06-28 17:45:50,255 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2110866.0, ans=0.125 2023-06-28 17:46:01,472 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=2110866.0, ans=0.125 2023-06-28 17:46:23,563 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.45 vs. limit=15.0 2023-06-28 17:46:28,844 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.14 vs. limit=15.0 2023-06-28 17:46:53,876 INFO [train.py:996] (2/4) Epoch 12, batch 16400, loss[loss=0.2281, simple_loss=0.2927, pruned_loss=0.08178, over 21558.00 frames. ], tot_loss[loss=0.2082, simple_loss=0.2881, pruned_loss=0.06413, over 4270533.62 frames. ], batch size: 548, lr: 2.41e-03, grad_scale: 32.0 2023-06-28 17:46:58,373 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.99 vs. limit=15.0 2023-06-28 17:47:52,143 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=2111226.0, ans=0.0 2023-06-28 17:48:02,439 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=2111226.0, ans=15.0 2023-06-28 17:48:16,144 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.551e+02 7.002e+02 9.291e+02 1.321e+03 2.557e+03, threshold=1.858e+03, percent-clipped=4.0 2023-06-28 17:48:23,226 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2111286.0, ans=0.125 2023-06-28 17:48:37,412 INFO [train.py:996] (2/4) Epoch 12, batch 16450, loss[loss=0.1936, simple_loss=0.2773, pruned_loss=0.05492, over 21509.00 frames. ], tot_loss[loss=0.211, simple_loss=0.2894, pruned_loss=0.06623, over 4278357.62 frames. ], batch size: 211, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 17:49:01,253 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=2111406.0, ans=0.0 2023-06-28 17:50:15,905 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2111586.0, ans=0.1 2023-06-28 17:50:20,643 INFO [train.py:996] (2/4) Epoch 12, batch 16500, loss[loss=0.2021, simple_loss=0.3066, pruned_loss=0.04884, over 20809.00 frames. ], tot_loss[loss=0.2109, simple_loss=0.2886, pruned_loss=0.06662, over 4278190.93 frames. ], batch size: 608, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 17:51:52,099 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.183e+02 7.579e+02 1.164e+03 1.772e+03 4.926e+03, threshold=2.328e+03, percent-clipped=21.0 2023-06-28 17:51:59,565 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2111886.0, ans=0.1 2023-06-28 17:52:09,285 INFO [train.py:996] (2/4) Epoch 12, batch 16550, loss[loss=0.2306, simple_loss=0.3166, pruned_loss=0.07227, over 21652.00 frames. ], tot_loss[loss=0.2067, simple_loss=0.286, pruned_loss=0.06373, over 4271866.41 frames. ], batch size: 389, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 17:52:23,281 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2111946.0, ans=0.125 2023-06-28 17:52:57,261 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=2112066.0, ans=0.0 2023-06-28 17:53:22,739 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.57 vs. limit=15.0 2023-06-28 17:53:49,152 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2112186.0, ans=0.0 2023-06-28 17:53:52,259 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2112186.0, ans=0.0 2023-06-28 17:53:54,960 INFO [train.py:996] (2/4) Epoch 12, batch 16600, loss[loss=0.2407, simple_loss=0.3468, pruned_loss=0.06729, over 21923.00 frames. ], tot_loss[loss=0.2129, simple_loss=0.2934, pruned_loss=0.06623, over 4265509.31 frames. ], batch size: 317, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 17:54:37,652 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2112366.0, ans=0.125 2023-06-28 17:55:04,508 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=2112426.0, ans=0.125 2023-06-28 17:55:27,796 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 6.049e+02 7.685e+02 9.523e+02 1.400e+03 3.440e+03, threshold=1.905e+03, percent-clipped=5.0 2023-06-28 17:55:40,101 INFO [train.py:996] (2/4) Epoch 12, batch 16650, loss[loss=0.2377, simple_loss=0.3223, pruned_loss=0.07656, over 21477.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.3033, pruned_loss=0.06905, over 4260906.67 frames. ], batch size: 211, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 17:55:42,653 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2112546.0, ans=0.125 2023-06-28 17:55:48,697 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.65 vs. limit=22.5 2023-06-28 17:55:57,685 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=2112546.0, ans=0.0 2023-06-28 17:57:35,443 INFO [train.py:996] (2/4) Epoch 12, batch 16700, loss[loss=0.2734, simple_loss=0.3597, pruned_loss=0.09357, over 21514.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.3057, pruned_loss=0.07028, over 4256404.86 frames. ], batch size: 471, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 17:57:44,737 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=2112846.0, ans=0.2 2023-06-28 17:57:57,257 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=2112906.0, ans=0.0 2023-06-28 17:58:08,075 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2112906.0, ans=0.125 2023-06-28 17:59:06,295 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2113086.0, ans=0.1 2023-06-28 17:59:08,945 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.823e+02 8.945e+02 1.338e+03 1.942e+03 4.278e+03, threshold=2.675e+03, percent-clipped=28.0 2023-06-28 17:59:26,656 INFO [train.py:996] (2/4) Epoch 12, batch 16750, loss[loss=0.2211, simple_loss=0.3134, pruned_loss=0.06443, over 19987.00 frames. ], tot_loss[loss=0.2257, simple_loss=0.3072, pruned_loss=0.07208, over 4254460.70 frames. ], batch size: 704, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 18:00:04,346 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 18:00:29,474 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.30 vs. limit=15.0 2023-06-28 18:00:48,127 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.53 vs. limit=15.0 2023-06-28 18:01:05,612 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=2113386.0, ans=10.0 2023-06-28 18:01:11,609 INFO [train.py:996] (2/4) Epoch 12, batch 16800, loss[loss=0.2579, simple_loss=0.333, pruned_loss=0.09136, over 21619.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.3104, pruned_loss=0.07215, over 4254198.00 frames. ], batch size: 471, lr: 2.41e-03, grad_scale: 32.0 2023-06-28 18:02:18,328 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2113626.0, ans=0.125 2023-06-28 18:02:44,389 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.630e+02 9.200e+02 1.390e+03 2.563e+03 4.897e+03, threshold=2.780e+03, percent-clipped=19.0 2023-06-28 18:02:57,917 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2113746.0, ans=0.125 2023-06-28 18:02:58,984 INFO [train.py:996] (2/4) Epoch 12, batch 16850, loss[loss=0.2151, simple_loss=0.2943, pruned_loss=0.06801, over 21485.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.3069, pruned_loss=0.07205, over 4261117.17 frames. ], batch size: 131, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 18:03:14,609 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=2113806.0, ans=0.125 2023-06-28 18:03:56,359 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=13.44 vs. limit=12.0 2023-06-28 18:04:36,526 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=2113986.0, ans=0.2 2023-06-28 18:04:40,770 INFO [train.py:996] (2/4) Epoch 12, batch 16900, loss[loss=0.1845, simple_loss=0.2576, pruned_loss=0.05575, over 21840.00 frames. ], tot_loss[loss=0.221, simple_loss=0.3013, pruned_loss=0.07038, over 4262833.71 frames. ], batch size: 102, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 18:04:48,491 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.88 vs. limit=22.5 2023-06-28 18:04:54,483 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2114046.0, ans=0.125 2023-06-28 18:05:00,999 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=2114106.0, ans=0.0 2023-06-28 18:05:15,316 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2114166.0, ans=0.125 2023-06-28 18:05:21,550 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2114166.0, ans=0.125 2023-06-28 18:05:29,893 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2114166.0, ans=0.125 2023-06-28 18:06:05,899 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=2114286.0, ans=0.0 2023-06-28 18:06:08,632 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.231e+02 8.316e+02 1.157e+03 1.734e+03 4.199e+03, threshold=2.313e+03, percent-clipped=8.0 2023-06-28 18:06:12,660 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=2114286.0, ans=0.125 2023-06-28 18:06:21,751 INFO [train.py:996] (2/4) Epoch 12, batch 16950, loss[loss=0.1966, simple_loss=0.2665, pruned_loss=0.06336, over 21860.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2935, pruned_loss=0.06836, over 4267780.25 frames. ], batch size: 98, lr: 2.41e-03, grad_scale: 8.0 2023-06-28 18:06:26,952 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2114346.0, ans=0.1 2023-06-28 18:07:45,395 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2114586.0, ans=0.1 2023-06-28 18:07:49,736 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.67 vs. limit=22.5 2023-06-28 18:07:58,348 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=2114646.0, ans=0.95 2023-06-28 18:07:59,335 INFO [train.py:996] (2/4) Epoch 12, batch 17000, loss[loss=0.1857, simple_loss=0.2484, pruned_loss=0.06151, over 21243.00 frames. ], tot_loss[loss=0.213, simple_loss=0.2898, pruned_loss=0.06806, over 4273389.66 frames. ], batch size: 608, lr: 2.41e-03, grad_scale: 8.0 2023-06-28 18:08:00,610 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=2114646.0, ans=22.5 2023-06-28 18:08:06,542 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=2114646.0, ans=0.2 2023-06-28 18:08:22,197 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2114706.0, ans=0.0 2023-06-28 18:08:45,069 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2114766.0, ans=0.125 2023-06-28 18:09:29,803 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.045e+02 1.097e+03 1.381e+03 1.822e+03 3.953e+03, threshold=2.762e+03, percent-clipped=12.0 2023-06-28 18:09:41,633 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2114946.0, ans=0.125 2023-06-28 18:09:42,677 INFO [train.py:996] (2/4) Epoch 12, batch 17050, loss[loss=0.2548, simple_loss=0.3348, pruned_loss=0.08737, over 21746.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.2966, pruned_loss=0.07019, over 4281060.53 frames. ], batch size: 441, lr: 2.41e-03, grad_scale: 8.0 2023-06-28 18:11:18,383 INFO [train.py:996] (2/4) Epoch 12, batch 17100, loss[loss=0.1962, simple_loss=0.2738, pruned_loss=0.05935, over 21830.00 frames. ], tot_loss[loss=0.2192, simple_loss=0.2961, pruned_loss=0.07112, over 4282982.99 frames. ], batch size: 282, lr: 2.41e-03, grad_scale: 8.0 2023-06-28 18:11:50,722 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=2115306.0, ans=0.125 2023-06-28 18:12:24,766 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=2115426.0, ans=0.2 2023-06-28 18:12:34,179 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.79 vs. limit=15.0 2023-06-28 18:12:35,623 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.19 vs. limit=6.0 2023-06-28 18:12:52,866 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.623e+02 7.702e+02 1.047e+03 1.626e+03 3.499e+03, threshold=2.095e+03, percent-clipped=2.0 2023-06-28 18:13:01,298 INFO [train.py:996] (2/4) Epoch 12, batch 17150, loss[loss=0.1863, simple_loss=0.2768, pruned_loss=0.04792, over 21701.00 frames. ], tot_loss[loss=0.2172, simple_loss=0.2927, pruned_loss=0.07088, over 4287408.03 frames. ], batch size: 389, lr: 2.41e-03, grad_scale: 8.0 2023-06-28 18:13:39,661 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2115606.0, ans=0.125 2023-06-28 18:13:51,507 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=2115666.0, ans=0.2 2023-06-28 18:14:44,911 INFO [train.py:996] (2/4) Epoch 12, batch 17200, loss[loss=0.229, simple_loss=0.3004, pruned_loss=0.07879, over 21329.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2926, pruned_loss=0.0708, over 4287022.92 frames. ], batch size: 176, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 18:15:22,492 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=2115906.0, ans=0.0 2023-06-28 18:16:10,466 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2116086.0, ans=0.0 2023-06-28 18:16:20,220 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.974e+02 7.324e+02 9.389e+02 1.283e+03 2.769e+03, threshold=1.878e+03, percent-clipped=7.0 2023-06-28 18:16:33,070 INFO [train.py:996] (2/4) Epoch 12, batch 17250, loss[loss=0.1963, simple_loss=0.2857, pruned_loss=0.0535, over 21856.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.2966, pruned_loss=0.07218, over 4287125.89 frames. ], batch size: 282, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 18:16:35,378 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=2116146.0, ans=0.0 2023-06-28 18:16:44,283 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.82 vs. limit=10.0 2023-06-28 18:17:01,964 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=2116206.0, ans=0.0 2023-06-28 18:17:32,983 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=2116326.0, ans=0.95 2023-06-28 18:18:15,713 INFO [train.py:996] (2/4) Epoch 12, batch 17300, loss[loss=0.2625, simple_loss=0.3397, pruned_loss=0.09265, over 21308.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.3041, pruned_loss=0.0747, over 4281368.50 frames. ], batch size: 143, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 18:18:17,150 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.09 vs. limit=10.0 2023-06-28 18:18:40,044 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2116506.0, ans=0.0 2023-06-28 18:19:01,324 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=2116566.0, ans=0.0 2023-06-28 18:19:48,139 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.078e+02 8.589e+02 1.215e+03 1.645e+03 3.725e+03, threshold=2.430e+03, percent-clipped=16.0 2023-06-28 18:19:59,797 INFO [train.py:996] (2/4) Epoch 12, batch 17350, loss[loss=0.2499, simple_loss=0.3428, pruned_loss=0.07854, over 21510.00 frames. ], tot_loss[loss=0.2276, simple_loss=0.3067, pruned_loss=0.0743, over 4274532.21 frames. ], batch size: 471, lr: 2.41e-03, grad_scale: 8.0 2023-06-28 18:20:39,378 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=6.14 vs. limit=22.5 2023-06-28 18:20:53,176 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.45 vs. limit=6.0 2023-06-28 18:21:13,246 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=2116926.0, ans=0.2 2023-06-28 18:21:38,408 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2116986.0, ans=0.0 2023-06-28 18:21:42,602 INFO [train.py:996] (2/4) Epoch 12, batch 17400, loss[loss=0.1899, simple_loss=0.2778, pruned_loss=0.05094, over 21644.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.3024, pruned_loss=0.07137, over 4271147.71 frames. ], batch size: 247, lr: 2.41e-03, grad_scale: 8.0 2023-06-28 18:21:50,535 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2117046.0, ans=0.125 2023-06-28 18:22:11,654 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=2117106.0, ans=0.2 2023-06-28 18:22:49,514 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2117226.0, ans=0.125 2023-06-28 18:23:04,813 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2117286.0, ans=0.125 2023-06-28 18:23:13,921 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.979e+02 8.447e+02 1.378e+03 1.932e+03 4.918e+03, threshold=2.756e+03, percent-clipped=14.0 2023-06-28 18:23:20,614 INFO [train.py:996] (2/4) Epoch 12, batch 17450, loss[loss=0.2496, simple_loss=0.3259, pruned_loss=0.0867, over 21536.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.2975, pruned_loss=0.06853, over 4273246.42 frames. ], batch size: 508, lr: 2.41e-03, grad_scale: 8.0 2023-06-28 18:23:33,052 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.22 vs. limit=10.0 2023-06-28 18:23:40,692 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=2117406.0, ans=0.2 2023-06-28 18:24:33,088 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=2117526.0, ans=0.2 2023-06-28 18:24:56,034 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=2117646.0, ans=0.125 2023-06-28 18:24:57,150 INFO [train.py:996] (2/4) Epoch 12, batch 17500, loss[loss=0.2022, simple_loss=0.2753, pruned_loss=0.06454, over 21808.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2932, pruned_loss=0.06692, over 4280797.44 frames. ], batch size: 282, lr: 2.41e-03, grad_scale: 8.0 2023-06-28 18:24:59,819 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.07 vs. limit=15.0 2023-06-28 18:25:15,430 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=2117646.0, ans=0.125 2023-06-28 18:25:17,798 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.16 vs. limit=6.0 2023-06-28 18:25:49,571 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=2117766.0, ans=0.07 2023-06-28 18:26:06,198 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.12 vs. limit=5.0 2023-06-28 18:26:16,442 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=2117826.0, ans=0.125 2023-06-28 18:26:30,506 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.515e+02 7.082e+02 9.304e+02 1.343e+03 2.877e+03, threshold=1.861e+03, percent-clipped=1.0 2023-06-28 18:26:31,662 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.90 vs. limit=15.0 2023-06-28 18:26:36,965 INFO [train.py:996] (2/4) Epoch 12, batch 17550, loss[loss=0.1945, simple_loss=0.2914, pruned_loss=0.04878, over 21795.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.2937, pruned_loss=0.06578, over 4270786.77 frames. ], batch size: 316, lr: 2.41e-03, grad_scale: 8.0 2023-06-28 18:26:37,655 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=2117946.0, ans=0.2 2023-06-28 18:27:01,631 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2118006.0, ans=0.1 2023-06-28 18:27:08,132 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=2118006.0, ans=0.125 2023-06-28 18:27:33,933 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2118066.0, ans=0.0 2023-06-28 18:27:50,150 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2118126.0, ans=0.125 2023-06-28 18:28:18,276 INFO [train.py:996] (2/4) Epoch 12, batch 17600, loss[loss=0.2247, simple_loss=0.3053, pruned_loss=0.07207, over 21272.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2969, pruned_loss=0.06664, over 4264762.86 frames. ], batch size: 159, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 18:28:46,357 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.36 vs. limit=15.0 2023-06-28 18:29:16,155 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2118366.0, ans=0.125 2023-06-28 18:29:48,885 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=2118486.0, ans=0.2 2023-06-28 18:29:51,277 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.265e+02 8.846e+02 1.006e+03 1.368e+03 3.785e+03, threshold=2.012e+03, percent-clipped=6.0 2023-06-28 18:30:03,181 INFO [train.py:996] (2/4) Epoch 12, batch 17650, loss[loss=0.2289, simple_loss=0.3104, pruned_loss=0.07371, over 21541.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2959, pruned_loss=0.06722, over 4266692.73 frames. ], batch size: 473, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 18:30:14,258 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2118546.0, ans=0.1 2023-06-28 18:30:19,421 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2118606.0, ans=0.125 2023-06-28 18:30:24,511 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=2118606.0, ans=10.0 2023-06-28 18:31:01,639 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2118666.0, ans=0.125 2023-06-28 18:31:11,747 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2118726.0, ans=0.1 2023-06-28 18:31:43,750 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=2118786.0, ans=0.125 2023-06-28 18:31:46,606 INFO [train.py:996] (2/4) Epoch 12, batch 17700, loss[loss=0.193, simple_loss=0.2705, pruned_loss=0.05775, over 20150.00 frames. ], tot_loss[loss=0.2105, simple_loss=0.2904, pruned_loss=0.06534, over 4258627.84 frames. ], batch size: 703, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 18:31:55,768 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=2118846.0, ans=0.125 2023-06-28 18:32:02,561 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=2118906.0, ans=0.2 2023-06-28 18:32:38,084 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2118966.0, ans=0.1 2023-06-28 18:33:10,210 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2119086.0, ans=0.125 2023-06-28 18:33:19,194 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.445e+02 8.687e+02 1.297e+03 2.273e+03 4.187e+03, threshold=2.595e+03, percent-clipped=29.0 2023-06-28 18:33:19,841 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=2119086.0, ans=0.2 2023-06-28 18:33:26,141 INFO [train.py:996] (2/4) Epoch 12, batch 17750, loss[loss=0.2411, simple_loss=0.3229, pruned_loss=0.07959, over 21380.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.2979, pruned_loss=0.06794, over 4260923.40 frames. ], batch size: 549, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 18:33:59,723 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2119206.0, ans=0.0 2023-06-28 18:34:09,363 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2119206.0, ans=0.125 2023-06-28 18:35:20,466 INFO [train.py:996] (2/4) Epoch 12, batch 17800, loss[loss=0.1928, simple_loss=0.2601, pruned_loss=0.06272, over 21173.00 frames. ], tot_loss[loss=0.215, simple_loss=0.2964, pruned_loss=0.06681, over 4259872.00 frames. ], batch size: 143, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 18:35:38,116 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=2119506.0, ans=0.125 2023-06-28 18:35:43,201 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=2119506.0, ans=0.0 2023-06-28 18:35:58,174 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2119566.0, ans=0.125 2023-06-28 18:36:52,582 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.029e+02 8.129e+02 1.136e+03 1.993e+03 4.758e+03, threshold=2.272e+03, percent-clipped=17.0 2023-06-28 18:36:59,641 INFO [train.py:996] (2/4) Epoch 12, batch 17850, loss[loss=0.2473, simple_loss=0.3214, pruned_loss=0.08662, over 21732.00 frames. ], tot_loss[loss=0.2164, simple_loss=0.2976, pruned_loss=0.06756, over 4266940.72 frames. ], batch size: 441, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 18:38:04,308 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.63 vs. limit=10.0 2023-06-28 18:38:27,834 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.42 vs. limit=22.5 2023-06-28 18:38:28,871 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2119986.0, ans=0.125 2023-06-28 18:38:40,366 INFO [train.py:996] (2/4) Epoch 12, batch 17900, loss[loss=0.2144, simple_loss=0.2981, pruned_loss=0.06539, over 21248.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.3015, pruned_loss=0.0688, over 4269968.05 frames. ], batch size: 159, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 18:39:06,171 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2120106.0, ans=0.125 2023-06-28 18:39:13,244 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=2120106.0, ans=0.125 2023-06-28 18:40:12,406 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.160e+02 9.224e+02 1.391e+03 2.083e+03 4.254e+03, threshold=2.783e+03, percent-clipped=21.0 2023-06-28 18:40:13,146 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=2120286.0, ans=0.125 2023-06-28 18:40:13,150 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2120286.0, ans=0.0 2023-06-28 18:40:19,128 INFO [train.py:996] (2/4) Epoch 12, batch 17950, loss[loss=0.1652, simple_loss=0.2406, pruned_loss=0.04491, over 21124.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.3002, pruned_loss=0.06562, over 4269469.58 frames. ], batch size: 143, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 18:40:24,476 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 18:41:34,553 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.16 vs. limit=15.0 2023-06-28 18:41:56,543 INFO [train.py:996] (2/4) Epoch 12, batch 18000, loss[loss=0.1744, simple_loss=0.2485, pruned_loss=0.0502, over 21785.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.2948, pruned_loss=0.06404, over 4257894.22 frames. ], batch size: 317, lr: 2.41e-03, grad_scale: 32.0 2023-06-28 18:41:56,543 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-28 18:42:08,113 INFO [zipformer.py:1728] (2/4) name=encoder.encoders.1.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([3.9701, 1.9944, 3.3824, 2.1671], device='cuda:2') 2023-06-28 18:42:13,349 INFO [zipformer.py:1728] (2/4) name=encoder.encoders.4.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([2.1225, 3.3424, 3.0991, 2.1829], device='cuda:2') 2023-06-28 18:42:16,413 INFO [train.py:1028] (2/4) Epoch 12, validation: loss=0.2604, simple_loss=0.3527, pruned_loss=0.08401, over 1796401.00 frames. 2023-06-28 18:42:16,414 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 23793MB 2023-06-28 18:42:21,083 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.53 vs. limit=22.5 2023-06-28 18:42:30,755 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2120646.0, ans=0.1 2023-06-28 18:43:07,134 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=2120766.0, ans=0.2 2023-06-28 18:43:19,110 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2120766.0, ans=0.0 2023-06-28 18:43:30,748 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2120826.0, ans=0.125 2023-06-28 18:43:55,006 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.231e+02 7.241e+02 9.176e+02 1.211e+03 3.223e+03, threshold=1.835e+03, percent-clipped=1.0 2023-06-28 18:44:00,008 INFO [train.py:996] (2/4) Epoch 12, batch 18050, loss[loss=0.1907, simple_loss=0.2671, pruned_loss=0.05718, over 21622.00 frames. ], tot_loss[loss=0.2082, simple_loss=0.2895, pruned_loss=0.06346, over 4257798.96 frames. ], batch size: 247, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 18:44:02,444 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2120946.0, ans=0.1 2023-06-28 18:44:50,144 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.02 vs. limit=15.0 2023-06-28 18:45:15,196 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2121126.0, ans=0.0 2023-06-28 18:45:23,330 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=2121126.0, ans=0.125 2023-06-28 18:45:44,367 INFO [train.py:996] (2/4) Epoch 12, batch 18100, loss[loss=0.2342, simple_loss=0.317, pruned_loss=0.07567, over 21591.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.2924, pruned_loss=0.06542, over 4256643.72 frames. ], batch size: 112, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 18:46:43,067 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 18:46:49,622 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2121426.0, ans=0.125 2023-06-28 18:46:49,716 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=2121426.0, ans=0.2 2023-06-28 18:47:07,279 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=2121426.0, ans=0.125 2023-06-28 18:47:12,280 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=2121486.0, ans=0.0 2023-06-28 18:47:23,005 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.434e+02 8.761e+02 1.193e+03 1.712e+03 3.705e+03, threshold=2.386e+03, percent-clipped=21.0 2023-06-28 18:47:26,567 INFO [train.py:996] (2/4) Epoch 12, batch 18150, loss[loss=0.1941, simple_loss=0.2586, pruned_loss=0.06486, over 21813.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.2957, pruned_loss=0.06536, over 4255937.12 frames. ], batch size: 112, lr: 2.41e-03, grad_scale: 8.0 2023-06-28 18:47:27,248 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=2121546.0, ans=0.0 2023-06-28 18:48:48,298 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2121726.0, ans=0.125 2023-06-28 18:48:55,451 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.15 vs. limit=22.5 2023-06-28 18:48:58,775 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.31 vs. limit=15.0 2023-06-28 18:49:01,830 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.60 vs. limit=10.0 2023-06-28 18:49:07,916 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=2121846.0, ans=0.025 2023-06-28 18:49:08,825 INFO [train.py:996] (2/4) Epoch 12, batch 18200, loss[loss=0.1884, simple_loss=0.2648, pruned_loss=0.05599, over 21807.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.2899, pruned_loss=0.06522, over 4255813.74 frames. ], batch size: 102, lr: 2.41e-03, grad_scale: 8.0 2023-06-28 18:49:13,883 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=2121846.0, ans=10.0 2023-06-28 18:49:24,975 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=2121846.0, ans=0.125 2023-06-28 18:49:45,733 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2121906.0, ans=0.125 2023-06-28 18:49:48,927 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=2121906.0, ans=0.2 2023-06-28 18:49:50,820 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.80 vs. limit=15.0 2023-06-28 18:49:56,880 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2121966.0, ans=0.0 2023-06-28 18:50:11,539 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2122026.0, ans=0.0 2023-06-28 18:50:12,155 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.whiten.whitening_limit, batch_count=2122026.0, ans=12.0 2023-06-28 18:50:13,109 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2122026.0, ans=0.0 2023-06-28 18:50:14,824 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2122026.0, ans=0.0 2023-06-28 18:50:21,109 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=2122026.0, ans=0.125 2023-06-28 18:50:34,294 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.88 vs. limit=22.5 2023-06-28 18:50:44,438 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.871e+02 6.470e+02 8.191e+02 1.481e+03 3.644e+03, threshold=1.638e+03, percent-clipped=8.0 2023-06-28 18:50:48,151 INFO [train.py:996] (2/4) Epoch 12, batch 18250, loss[loss=0.1699, simple_loss=0.2533, pruned_loss=0.04322, over 21859.00 frames. ], tot_loss[loss=0.2043, simple_loss=0.2825, pruned_loss=0.06311, over 4261593.82 frames. ], batch size: 124, lr: 2.41e-03, grad_scale: 8.0 2023-06-28 18:50:53,669 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=2122146.0, ans=0.5 2023-06-28 18:50:58,603 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 18:51:35,507 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2122266.0, ans=0.125 2023-06-28 18:51:38,970 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=2122266.0, ans=0.2 2023-06-28 18:51:39,418 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.34 vs. limit=22.5 2023-06-28 18:51:44,748 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.59 vs. limit=22.5 2023-06-28 18:52:09,488 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=2122386.0, ans=0.0 2023-06-28 18:52:19,599 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=2122386.0, ans=10.0 2023-06-28 18:52:25,409 INFO [train.py:996] (2/4) Epoch 12, batch 18300, loss[loss=0.1943, simple_loss=0.2604, pruned_loss=0.06407, over 21300.00 frames. ], tot_loss[loss=0.2035, simple_loss=0.2822, pruned_loss=0.06246, over 4268661.70 frames. ], batch size: 176, lr: 2.41e-03, grad_scale: 8.0 2023-06-28 18:53:20,842 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.60 vs. limit=10.0 2023-06-28 18:53:39,654 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2122626.0, ans=0.0 2023-06-28 18:53:47,600 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2122686.0, ans=0.0 2023-06-28 18:54:03,761 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.548e+02 1.033e+03 1.487e+03 2.193e+03 4.357e+03, threshold=2.975e+03, percent-clipped=43.0 2023-06-28 18:54:06,772 INFO [train.py:996] (2/4) Epoch 12, batch 18350, loss[loss=0.1868, simple_loss=0.2636, pruned_loss=0.05501, over 21839.00 frames. ], tot_loss[loss=0.2055, simple_loss=0.2864, pruned_loss=0.06235, over 4268956.39 frames. ], batch size: 118, lr: 2.41e-03, grad_scale: 8.0 2023-06-28 18:55:00,772 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.93 vs. limit=15.0 2023-06-28 18:55:23,781 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2122926.0, ans=0.0 2023-06-28 18:55:40,104 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2122986.0, ans=0.0 2023-06-28 18:55:49,919 INFO [train.py:996] (2/4) Epoch 12, batch 18400, loss[loss=0.1676, simple_loss=0.2522, pruned_loss=0.04149, over 21212.00 frames. ], tot_loss[loss=0.202, simple_loss=0.2825, pruned_loss=0.06078, over 4272535.45 frames. ], batch size: 176, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 18:56:12,855 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.11 vs. limit=6.0 2023-06-28 18:56:38,212 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2123166.0, ans=0.1 2023-06-28 18:57:01,647 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.89 vs. limit=22.5 2023-06-28 18:57:18,606 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=2123286.0, ans=0.2 2023-06-28 18:57:19,064 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.94 vs. limit=15.0 2023-06-28 18:57:22,751 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.644e+02 6.567e+02 9.671e+02 1.816e+03 3.680e+03, threshold=1.934e+03, percent-clipped=2.0 2023-06-28 18:57:26,095 INFO [train.py:996] (2/4) Epoch 12, batch 18450, loss[loss=0.1759, simple_loss=0.2497, pruned_loss=0.05104, over 21690.00 frames. ], tot_loss[loss=0.1988, simple_loss=0.2806, pruned_loss=0.0585, over 4267867.67 frames. ], batch size: 298, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 18:57:39,641 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2123346.0, ans=0.1 2023-06-28 18:58:25,962 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.57 vs. limit=6.0 2023-06-28 18:59:07,202 INFO [train.py:996] (2/4) Epoch 12, batch 18500, loss[loss=0.1736, simple_loss=0.2445, pruned_loss=0.05135, over 21811.00 frames. ], tot_loss[loss=0.1963, simple_loss=0.2772, pruned_loss=0.05773, over 4260491.77 frames. ], batch size: 124, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 18:59:11,651 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.49 vs. limit=6.0 2023-06-28 18:59:51,546 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2123706.0, ans=0.125 2023-06-28 18:59:56,317 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2123766.0, ans=0.125 2023-06-28 19:00:10,544 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.35 vs. limit=15.0 2023-06-28 19:00:45,281 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.695e+02 8.087e+02 1.310e+03 2.007e+03 4.820e+03, threshold=2.620e+03, percent-clipped=25.0 2023-06-28 19:00:48,726 INFO [train.py:996] (2/4) Epoch 12, batch 18550, loss[loss=0.1899, simple_loss=0.2607, pruned_loss=0.05955, over 21361.00 frames. ], tot_loss[loss=0.1954, simple_loss=0.2758, pruned_loss=0.05747, over 4254490.44 frames. ], batch size: 194, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 19:00:52,912 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=2123946.0, ans=0.125 2023-06-28 19:01:55,254 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.77 vs. limit=12.0 2023-06-28 19:02:23,019 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=2124186.0, ans=10.0 2023-06-28 19:02:28,001 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=2124186.0, ans=0.0 2023-06-28 19:02:32,390 INFO [train.py:996] (2/4) Epoch 12, batch 18600, loss[loss=0.2286, simple_loss=0.3218, pruned_loss=0.06768, over 21596.00 frames. ], tot_loss[loss=0.1953, simple_loss=0.2741, pruned_loss=0.05822, over 4256679.34 frames. ], batch size: 442, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 19:03:18,712 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=2124366.0, ans=0.2 2023-06-28 19:03:32,627 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.00 vs. limit=6.0 2023-06-28 19:03:58,382 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2124486.0, ans=0.1 2023-06-28 19:04:04,786 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=2124486.0, ans=0.2 2023-06-28 19:04:12,043 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.057e+02 7.816e+02 1.103e+03 1.650e+03 3.069e+03, threshold=2.205e+03, percent-clipped=1.0 2023-06-28 19:04:13,777 INFO [train.py:996] (2/4) Epoch 12, batch 18650, loss[loss=0.2128, simple_loss=0.2824, pruned_loss=0.07162, over 21575.00 frames. ], tot_loss[loss=0.1951, simple_loss=0.2738, pruned_loss=0.05823, over 4264298.37 frames. ], batch size: 391, lr: 2.41e-03, grad_scale: 8.0 2023-06-28 19:04:20,996 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=2124546.0, ans=0.2 2023-06-28 19:04:36,504 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.62 vs. limit=8.0 2023-06-28 19:05:13,817 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.59 vs. limit=15.0 2023-06-28 19:05:55,210 INFO [train.py:996] (2/4) Epoch 12, batch 18700, loss[loss=0.2089, simple_loss=0.2791, pruned_loss=0.06933, over 21873.00 frames. ], tot_loss[loss=0.1949, simple_loss=0.2707, pruned_loss=0.0596, over 4264111.34 frames. ], batch size: 371, lr: 2.41e-03, grad_scale: 8.0 2023-06-28 19:06:03,874 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2124846.0, ans=0.1 2023-06-28 19:07:15,484 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.29 vs. limit=6.0 2023-06-28 19:07:35,705 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.958e+02 6.838e+02 8.648e+02 1.289e+03 2.694e+03, threshold=1.730e+03, percent-clipped=5.0 2023-06-28 19:07:37,315 INFO [train.py:996] (2/4) Epoch 12, batch 18750, loss[loss=0.2012, simple_loss=0.274, pruned_loss=0.06424, over 21458.00 frames. ], tot_loss[loss=0.1963, simple_loss=0.2705, pruned_loss=0.06103, over 4266932.48 frames. ], batch size: 131, lr: 2.41e-03, grad_scale: 8.0 2023-06-28 19:07:49,948 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.91 vs. limit=6.0 2023-06-28 19:07:52,765 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=2125206.0, ans=0.0 2023-06-28 19:08:07,518 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=2125206.0, ans=0.04949747468305833 2023-06-28 19:08:45,825 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.62 vs. limit=22.5 2023-06-28 19:09:19,262 INFO [train.py:996] (2/4) Epoch 12, batch 18800, loss[loss=0.2291, simple_loss=0.2938, pruned_loss=0.08224, over 21215.00 frames. ], tot_loss[loss=0.2016, simple_loss=0.2779, pruned_loss=0.06266, over 4251674.73 frames. ], batch size: 143, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 19:09:24,949 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=2125446.0, ans=0.0 2023-06-28 19:10:40,745 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.89 vs. limit=6.0 2023-06-28 19:10:58,898 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.378e+02 7.621e+02 1.255e+03 1.956e+03 3.877e+03, threshold=2.510e+03, percent-clipped=29.0 2023-06-28 19:11:00,575 INFO [train.py:996] (2/4) Epoch 12, batch 18850, loss[loss=0.1904, simple_loss=0.2696, pruned_loss=0.05558, over 21803.00 frames. ], tot_loss[loss=0.1976, simple_loss=0.2764, pruned_loss=0.05938, over 4257369.08 frames. ], batch size: 118, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 19:11:10,897 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=2125746.0, ans=0.0 2023-06-28 19:11:24,953 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=2125806.0, ans=0.2 2023-06-28 19:12:23,653 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.92 vs. limit=15.0 2023-06-28 19:12:37,710 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=2125986.0, ans=0.125 2023-06-28 19:12:40,424 INFO [train.py:996] (2/4) Epoch 12, batch 18900, loss[loss=0.1648, simple_loss=0.2207, pruned_loss=0.05449, over 20777.00 frames. ], tot_loss[loss=0.1945, simple_loss=0.2719, pruned_loss=0.05859, over 4264848.64 frames. ], batch size: 609, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 19:12:45,518 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2126046.0, ans=0.125 2023-06-28 19:12:49,418 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.49 vs. limit=15.0 2023-06-28 19:12:55,402 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=2126106.0, ans=0.2 2023-06-28 19:13:00,402 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff2.min_abs, batch_count=2126106.0, ans=0.1 2023-06-28 19:13:00,543 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2126106.0, ans=0.1 2023-06-28 19:14:14,607 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.13 vs. limit=22.5 2023-06-28 19:14:14,964 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.285e+02 7.566e+02 1.259e+03 1.840e+03 2.966e+03, threshold=2.518e+03, percent-clipped=3.0 2023-06-28 19:14:16,567 INFO [train.py:996] (2/4) Epoch 12, batch 18950, loss[loss=0.1959, simple_loss=0.2667, pruned_loss=0.06255, over 21445.00 frames. ], tot_loss[loss=0.1973, simple_loss=0.2734, pruned_loss=0.06058, over 4279548.17 frames. ], batch size: 194, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 19:14:27,322 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2126346.0, ans=0.1 2023-06-28 19:14:35,661 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=2126406.0, ans=0.2 2023-06-28 19:14:43,768 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2126406.0, ans=0.125 2023-06-28 19:14:51,281 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.57 vs. limit=15.0 2023-06-28 19:14:52,596 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=2126406.0, ans=0.015 2023-06-28 19:15:37,124 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=2126586.0, ans=0.2 2023-06-28 19:15:40,421 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=2126586.0, ans=0.125 2023-06-28 19:15:54,013 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=2126646.0, ans=0.0 2023-06-28 19:15:55,095 INFO [train.py:996] (2/4) Epoch 12, batch 19000, loss[loss=0.289, simple_loss=0.3513, pruned_loss=0.1133, over 21414.00 frames. ], tot_loss[loss=0.2047, simple_loss=0.2834, pruned_loss=0.06306, over 4283910.51 frames. ], batch size: 471, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 19:16:10,971 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=2126646.0, ans=0.04949747468305833 2023-06-28 19:16:42,527 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2126766.0, ans=0.0 2023-06-28 19:17:32,191 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.360e+02 7.287e+02 9.721e+02 1.319e+03 3.703e+03, threshold=1.944e+03, percent-clipped=9.0 2023-06-28 19:17:33,788 INFO [train.py:996] (2/4) Epoch 12, batch 19050, loss[loss=0.2289, simple_loss=0.2852, pruned_loss=0.08624, over 20076.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.2886, pruned_loss=0.0671, over 4280725.32 frames. ], batch size: 703, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 19:18:22,616 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.54 vs. limit=6.0 2023-06-28 19:18:49,464 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.76 vs. limit=22.5 2023-06-28 19:18:57,825 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.03 vs. limit=15.0 2023-06-28 19:19:12,356 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.35 vs. limit=15.0 2023-06-28 19:19:16,208 INFO [train.py:996] (2/4) Epoch 12, batch 19100, loss[loss=0.1906, simple_loss=0.2473, pruned_loss=0.06695, over 20781.00 frames. ], tot_loss[loss=0.211, simple_loss=0.2862, pruned_loss=0.06786, over 4279895.67 frames. ], batch size: 608, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 19:20:27,595 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=2127426.0, ans=0.07 2023-06-28 19:21:01,436 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.479e+02 7.973e+02 1.169e+03 1.755e+03 3.524e+03, threshold=2.338e+03, percent-clipped=19.0 2023-06-28 19:21:03,173 INFO [train.py:996] (2/4) Epoch 12, batch 19150, loss[loss=0.2224, simple_loss=0.3199, pruned_loss=0.06241, over 21581.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.2883, pruned_loss=0.06841, over 4280170.57 frames. ], batch size: 230, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 19:21:35,541 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=2127606.0, ans=0.025 2023-06-28 19:22:04,890 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=2127726.0, ans=0.125 2023-06-28 19:22:53,940 INFO [train.py:996] (2/4) Epoch 12, batch 19200, loss[loss=0.2617, simple_loss=0.3624, pruned_loss=0.08049, over 21682.00 frames. ], tot_loss[loss=0.2192, simple_loss=0.299, pruned_loss=0.06972, over 4280084.69 frames. ], batch size: 389, lr: 2.41e-03, grad_scale: 32.0 2023-06-28 19:23:21,244 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=2127906.0, ans=0.125 2023-06-28 19:23:36,295 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2127966.0, ans=0.125 2023-06-28 19:23:44,423 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=2127966.0, ans=0.125 2023-06-28 19:24:09,231 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2128086.0, ans=0.125 2023-06-28 19:24:20,839 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=2128086.0, ans=0.04949747468305833 2023-06-28 19:24:36,188 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.048e+02 8.519e+02 1.165e+03 1.659e+03 4.865e+03, threshold=2.330e+03, percent-clipped=13.0 2023-06-28 19:24:36,224 INFO [train.py:996] (2/4) Epoch 12, batch 19250, loss[loss=0.1927, simple_loss=0.2764, pruned_loss=0.05453, over 21875.00 frames. ], tot_loss[loss=0.2147, simple_loss=0.2991, pruned_loss=0.06516, over 4264492.33 frames. ], batch size: 371, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 19:26:09,339 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=2128386.0, ans=0.0 2023-06-28 19:26:17,554 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=2128446.0, ans=0.04949747468305833 2023-06-28 19:26:18,602 INFO [train.py:996] (2/4) Epoch 12, batch 19300, loss[loss=0.1863, simple_loss=0.2626, pruned_loss=0.05499, over 21255.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.2957, pruned_loss=0.06356, over 4266119.54 frames. ], batch size: 176, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 19:26:34,306 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2128446.0, ans=0.125 2023-06-28 19:27:53,522 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.85 vs. limit=22.5 2023-06-28 19:27:57,285 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.967e+02 7.718e+02 1.195e+03 1.796e+03 4.248e+03, threshold=2.390e+03, percent-clipped=9.0 2023-06-28 19:27:57,316 INFO [train.py:996] (2/4) Epoch 12, batch 19350, loss[loss=0.16, simple_loss=0.2455, pruned_loss=0.03726, over 21528.00 frames. ], tot_loss[loss=0.2052, simple_loss=0.29, pruned_loss=0.06023, over 4263811.05 frames. ], batch size: 212, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 19:28:15,467 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2128746.0, ans=0.1 2023-06-28 19:28:26,996 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=2128806.0, ans=0.5 2023-06-28 19:28:58,388 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=2128926.0, ans=0.125 2023-06-28 19:29:37,800 INFO [train.py:996] (2/4) Epoch 12, batch 19400, loss[loss=0.2389, simple_loss=0.3127, pruned_loss=0.08252, over 21758.00 frames. ], tot_loss[loss=0.2035, simple_loss=0.2875, pruned_loss=0.05979, over 4268944.04 frames. ], batch size: 112, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 19:29:56,527 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=2129046.0, ans=0.0 2023-06-28 19:30:13,157 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=2129106.0, ans=0.07 2023-06-28 19:30:23,143 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=2129166.0, ans=0.04949747468305833 2023-06-28 19:30:34,540 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 19:31:19,981 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.788e+02 6.972e+02 8.917e+02 1.265e+03 3.232e+03, threshold=1.783e+03, percent-clipped=5.0 2023-06-28 19:31:20,013 INFO [train.py:996] (2/4) Epoch 12, batch 19450, loss[loss=0.1955, simple_loss=0.2517, pruned_loss=0.06963, over 21096.00 frames. ], tot_loss[loss=0.2045, simple_loss=0.2855, pruned_loss=0.06177, over 4279622.09 frames. ], batch size: 608, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 19:31:28,744 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=2129346.0, ans=0.125 2023-06-28 19:31:31,860 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2129346.0, ans=0.0 2023-06-28 19:32:40,437 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.11 vs. limit=15.0 2023-06-28 19:33:02,593 INFO [train.py:996] (2/4) Epoch 12, batch 19500, loss[loss=0.1772, simple_loss=0.242, pruned_loss=0.05621, over 21462.00 frames. ], tot_loss[loss=0.203, simple_loss=0.281, pruned_loss=0.06248, over 4284753.21 frames. ], batch size: 195, lr: 2.40e-03, grad_scale: 8.0 2023-06-28 19:34:22,702 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.32 vs. limit=15.0 2023-06-28 19:34:42,687 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=2129946.0, ans=0.0 2023-06-28 19:34:43,718 INFO [train.py:996] (2/4) Epoch 12, batch 19550, loss[loss=0.1858, simple_loss=0.2748, pruned_loss=0.04843, over 21214.00 frames. ], tot_loss[loss=0.2004, simple_loss=0.277, pruned_loss=0.06184, over 4284716.24 frames. ], batch size: 159, lr: 2.40e-03, grad_scale: 8.0 2023-06-28 19:34:45,242 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.793e+02 7.099e+02 1.131e+03 1.724e+03 3.417e+03, threshold=2.262e+03, percent-clipped=22.0 2023-06-28 19:34:45,846 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2129946.0, ans=0.1 2023-06-28 19:35:25,404 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2130066.0, ans=0.125 2023-06-28 19:35:58,851 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=2130186.0, ans=0.05 2023-06-28 19:36:25,926 INFO [train.py:996] (2/4) Epoch 12, batch 19600, loss[loss=0.2181, simple_loss=0.2845, pruned_loss=0.07586, over 21764.00 frames. ], tot_loss[loss=0.2014, simple_loss=0.2786, pruned_loss=0.06207, over 4288307.08 frames. ], batch size: 441, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 19:36:45,384 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2130246.0, ans=0.0 2023-06-28 19:38:05,715 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.89 vs. limit=6.0 2023-06-28 19:38:14,413 INFO [train.py:996] (2/4) Epoch 12, batch 19650, loss[loss=0.2252, simple_loss=0.2973, pruned_loss=0.07658, over 21857.00 frames. ], tot_loss[loss=0.2061, simple_loss=0.2831, pruned_loss=0.06458, over 4287060.44 frames. ], batch size: 351, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 19:38:16,161 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.322e+02 7.698e+02 1.187e+03 1.875e+03 3.672e+03, threshold=2.374e+03, percent-clipped=11.0 2023-06-28 19:38:37,391 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2130606.0, ans=0.125 2023-06-28 19:39:06,149 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=2130666.0, ans=0.025 2023-06-28 19:39:48,522 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2130786.0, ans=0.125 2023-06-28 19:40:00,307 INFO [train.py:996] (2/4) Epoch 12, batch 19700, loss[loss=0.2034, simple_loss=0.2966, pruned_loss=0.05504, over 21726.00 frames. ], tot_loss[loss=0.2098, simple_loss=0.2881, pruned_loss=0.06573, over 4288149.43 frames. ], batch size: 298, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 19:40:24,272 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.18 vs. limit=15.0 2023-06-28 19:41:06,668 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.54 vs. limit=10.0 2023-06-28 19:41:29,438 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2131086.0, ans=0.125 2023-06-28 19:41:50,333 INFO [train.py:996] (2/4) Epoch 12, batch 19750, loss[loss=0.3533, simple_loss=0.4317, pruned_loss=0.1374, over 21440.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.298, pruned_loss=0.0672, over 4278037.25 frames. ], batch size: 507, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 19:41:51,906 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.980e+02 8.894e+02 1.243e+03 1.861e+03 5.840e+03, threshold=2.486e+03, percent-clipped=14.0 2023-06-28 19:42:17,581 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2131206.0, ans=0.125 2023-06-28 19:43:31,907 INFO [train.py:996] (2/4) Epoch 12, batch 19800, loss[loss=0.1674, simple_loss=0.2519, pruned_loss=0.04147, over 21771.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.2976, pruned_loss=0.06728, over 4285323.45 frames. ], batch size: 298, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 19:43:49,292 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2131506.0, ans=0.125 2023-06-28 19:43:50,863 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=2131506.0, ans=10.0 2023-06-28 19:44:13,388 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=2131566.0, ans=0.0 2023-06-28 19:44:53,636 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2131686.0, ans=0.125 2023-06-28 19:45:08,611 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=2131686.0, ans=0.04949747468305833 2023-06-28 19:45:16,351 INFO [train.py:996] (2/4) Epoch 12, batch 19850, loss[loss=0.169, simple_loss=0.2575, pruned_loss=0.04028, over 21689.00 frames. ], tot_loss[loss=0.2072, simple_loss=0.2886, pruned_loss=0.06294, over 4273088.17 frames. ], batch size: 332, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 19:45:18,108 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.069e+02 7.581e+02 9.843e+02 1.508e+03 3.551e+03, threshold=1.969e+03, percent-clipped=6.0 2023-06-28 19:45:39,032 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=2131806.0, ans=0.2 2023-06-28 19:46:27,050 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=2131926.0, ans=0.2 2023-06-28 19:46:59,320 INFO [train.py:996] (2/4) Epoch 12, batch 19900, loss[loss=0.1735, simple_loss=0.2599, pruned_loss=0.04355, over 21360.00 frames. ], tot_loss[loss=0.2051, simple_loss=0.2892, pruned_loss=0.06048, over 4277776.97 frames. ], batch size: 211, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 19:47:16,623 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2132106.0, ans=0.0 2023-06-28 19:47:39,892 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.62 vs. limit=15.0 2023-06-28 19:47:52,923 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2132166.0, ans=0.125 2023-06-28 19:48:11,573 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2132226.0, ans=0.0 2023-06-28 19:48:42,867 INFO [train.py:996] (2/4) Epoch 12, batch 19950, loss[loss=0.234, simple_loss=0.3022, pruned_loss=0.08292, over 21402.00 frames. ], tot_loss[loss=0.2016, simple_loss=0.2832, pruned_loss=0.05996, over 4267300.00 frames. ], batch size: 507, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 19:48:43,719 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2132346.0, ans=0.1 2023-06-28 19:48:44,573 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.913e+02 9.095e+02 1.320e+03 1.827e+03 2.856e+03, threshold=2.640e+03, percent-clipped=20.0 2023-06-28 19:49:28,328 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2132406.0, ans=0.1 2023-06-28 19:50:20,023 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2132586.0, ans=0.125 2023-06-28 19:50:25,837 INFO [train.py:996] (2/4) Epoch 12, batch 20000, loss[loss=0.2111, simple_loss=0.2875, pruned_loss=0.06739, over 21884.00 frames. ], tot_loss[loss=0.2013, simple_loss=0.2825, pruned_loss=0.06008, over 4265449.40 frames. ], batch size: 118, lr: 2.40e-03, grad_scale: 32.0 2023-06-28 19:51:06,025 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2132706.0, ans=0.125 2023-06-28 19:52:06,803 INFO [train.py:996] (2/4) Epoch 12, batch 20050, loss[loss=0.2025, simple_loss=0.278, pruned_loss=0.06347, over 21246.00 frames. ], tot_loss[loss=0.205, simple_loss=0.2851, pruned_loss=0.06242, over 4277275.43 frames. ], batch size: 176, lr: 2.40e-03, grad_scale: 32.0 2023-06-28 19:52:08,365 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.463e+02 7.625e+02 1.079e+03 1.735e+03 4.168e+03, threshold=2.158e+03, percent-clipped=5.0 2023-06-28 19:52:35,853 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=6.42 vs. limit=22.5 2023-06-28 19:53:00,096 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=2133066.0, ans=0.125 2023-06-28 19:53:38,513 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 19:53:44,639 INFO [train.py:996] (2/4) Epoch 12, batch 20100, loss[loss=0.228, simple_loss=0.3048, pruned_loss=0.07565, over 21745.00 frames. ], tot_loss[loss=0.208, simple_loss=0.2871, pruned_loss=0.06447, over 4287191.62 frames. ], batch size: 389, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 19:54:02,292 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.92 vs. limit=6.0 2023-06-28 19:54:13,814 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.72 vs. limit=15.0 2023-06-28 19:54:35,450 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.61 vs. limit=15.0 2023-06-28 19:55:32,174 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=2133486.0, ans=0.125 2023-06-28 19:55:38,347 INFO [train.py:996] (2/4) Epoch 12, batch 20150, loss[loss=0.3077, simple_loss=0.3606, pruned_loss=0.1274, over 21307.00 frames. ], tot_loss[loss=0.2142, simple_loss=0.2941, pruned_loss=0.06713, over 4290703.53 frames. ], batch size: 507, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 19:55:41,576 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.747e+02 8.369e+02 1.261e+03 1.979e+03 4.381e+03, threshold=2.521e+03, percent-clipped=21.0 2023-06-28 19:56:01,049 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=2133606.0, ans=0.0 2023-06-28 19:57:05,307 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 19:57:23,462 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=2133846.0, ans=0.035 2023-06-28 19:57:24,614 INFO [train.py:996] (2/4) Epoch 12, batch 20200, loss[loss=0.1922, simple_loss=0.2592, pruned_loss=0.06256, over 21837.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.3014, pruned_loss=0.06989, over 4285984.64 frames. ], batch size: 107, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 19:57:46,943 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.35 vs. limit=15.0 2023-06-28 19:57:56,846 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2133906.0, ans=0.0 2023-06-28 19:57:58,236 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2133906.0, ans=0.0 2023-06-28 19:58:38,082 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.80 vs. limit=6.0 2023-06-28 19:58:42,989 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.91 vs. limit=10.0 2023-06-28 19:59:11,835 INFO [train.py:996] (2/4) Epoch 12, batch 20250, loss[loss=0.1994, simple_loss=0.2928, pruned_loss=0.053, over 21822.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.3013, pruned_loss=0.06861, over 4279166.66 frames. ], batch size: 316, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 19:59:19,724 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.497e+02 8.759e+02 1.398e+03 2.270e+03 4.094e+03, threshold=2.796e+03, percent-clipped=18.0 2023-06-28 19:59:45,998 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=2134206.0, ans=10.0 2023-06-28 19:59:47,798 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=2134206.0, ans=0.04949747468305833 2023-06-28 20:00:27,154 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=2134326.0, ans=0.1 2023-06-28 20:00:29,296 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.33 vs. limit=12.0 2023-06-28 20:00:53,970 INFO [train.py:996] (2/4) Epoch 12, batch 20300, loss[loss=0.2145, simple_loss=0.3084, pruned_loss=0.0603, over 21783.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.3005, pruned_loss=0.06682, over 4283138.63 frames. ], batch size: 371, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:02:34,215 INFO [train.py:996] (2/4) Epoch 12, batch 20350, loss[loss=0.195, simple_loss=0.2794, pruned_loss=0.05532, over 21873.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.3014, pruned_loss=0.06778, over 4287334.69 frames. ], batch size: 98, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:02:37,269 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.278e+02 8.027e+02 1.220e+03 1.701e+03 2.990e+03, threshold=2.441e+03, percent-clipped=1.0 2023-06-28 20:02:52,678 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2134746.0, ans=0.1 2023-06-28 20:03:14,535 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2134866.0, ans=0.125 2023-06-28 20:04:21,622 INFO [train.py:996] (2/4) Epoch 12, batch 20400, loss[loss=0.2368, simple_loss=0.3149, pruned_loss=0.07935, over 21908.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.3044, pruned_loss=0.07066, over 4283030.67 frames. ], batch size: 371, lr: 2.40e-03, grad_scale: 32.0 2023-06-28 20:05:08,216 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2135166.0, ans=0.1 2023-06-28 20:05:18,506 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2135166.0, ans=0.125 2023-06-28 20:05:19,976 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=2135226.0, ans=0.2 2023-06-28 20:05:31,918 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.79 vs. limit=15.0 2023-06-28 20:05:34,546 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2135226.0, ans=0.125 2023-06-28 20:05:34,599 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2135226.0, ans=0.125 2023-06-28 20:05:56,852 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2135346.0, ans=0.1 2023-06-28 20:05:58,025 INFO [train.py:996] (2/4) Epoch 12, batch 20450, loss[loss=0.1875, simple_loss=0.2357, pruned_loss=0.06963, over 20219.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.3038, pruned_loss=0.07226, over 4259621.12 frames. ], batch size: 703, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:06:03,015 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.519e+02 7.818e+02 1.125e+03 1.970e+03 4.809e+03, threshold=2.251e+03, percent-clipped=13.0 2023-06-28 20:06:26,809 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2135406.0, ans=0.0 2023-06-28 20:06:36,702 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2135466.0, ans=0.1 2023-06-28 20:06:50,050 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.15 vs. limit=15.0 2023-06-28 20:07:39,494 INFO [train.py:996] (2/4) Epoch 12, batch 20500, loss[loss=0.2067, simple_loss=0.2712, pruned_loss=0.07112, over 21692.00 frames. ], tot_loss[loss=0.2219, simple_loss=0.2996, pruned_loss=0.0721, over 4261445.24 frames. ], batch size: 247, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:07:40,196 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2135646.0, ans=0.1 2023-06-28 20:08:07,056 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=2135706.0, ans=0.5 2023-06-28 20:08:30,413 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2135766.0, ans=0.125 2023-06-28 20:09:01,463 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2135886.0, ans=0.125 2023-06-28 20:09:27,004 INFO [train.py:996] (2/4) Epoch 12, batch 20550, loss[loss=0.2016, simple_loss=0.3008, pruned_loss=0.05123, over 21182.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2924, pruned_loss=0.07003, over 4256796.05 frames. ], batch size: 548, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:09:32,110 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.925e+02 7.744e+02 1.015e+03 1.488e+03 3.056e+03, threshold=2.029e+03, percent-clipped=4.0 2023-06-28 20:11:10,580 INFO [train.py:996] (2/4) Epoch 12, batch 20600, loss[loss=0.2492, simple_loss=0.3193, pruned_loss=0.08959, over 21757.00 frames. ], tot_loss[loss=0.2147, simple_loss=0.2935, pruned_loss=0.06791, over 4256156.73 frames. ], batch size: 112, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:11:36,824 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2136306.0, ans=0.125 2023-06-28 20:11:36,871 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=2136306.0, ans=0.125 2023-06-28 20:12:07,817 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.35 vs. limit=15.0 2023-06-28 20:12:27,879 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.64 vs. limit=8.0 2023-06-28 20:12:43,346 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=2136486.0, ans=0.0 2023-06-28 20:12:45,957 INFO [train.py:996] (2/4) Epoch 12, batch 20650, loss[loss=0.1845, simple_loss=0.2478, pruned_loss=0.06061, over 21278.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.29, pruned_loss=0.06819, over 4251648.49 frames. ], batch size: 160, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:12:51,205 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.648e+02 9.695e+02 1.455e+03 2.228e+03 5.123e+03, threshold=2.910e+03, percent-clipped=30.0 2023-06-28 20:14:27,984 INFO [train.py:996] (2/4) Epoch 12, batch 20700, loss[loss=0.2381, simple_loss=0.3242, pruned_loss=0.07595, over 21624.00 frames. ], tot_loss[loss=0.2074, simple_loss=0.2837, pruned_loss=0.0655, over 4242254.96 frames. ], batch size: 414, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:14:38,421 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=2136846.0, ans=0.0 2023-06-28 20:14:57,281 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2136906.0, ans=0.125 2023-06-28 20:15:43,981 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2137026.0, ans=0.125 2023-06-28 20:16:09,306 INFO [train.py:996] (2/4) Epoch 12, batch 20750, loss[loss=0.2489, simple_loss=0.3752, pruned_loss=0.0613, over 20794.00 frames. ], tot_loss[loss=0.2079, simple_loss=0.286, pruned_loss=0.06492, over 4251583.69 frames. ], batch size: 607, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:16:10,559 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.09 vs. limit=12.0 2023-06-28 20:16:12,139 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.79 vs. limit=10.0 2023-06-28 20:16:13,453 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=2137146.0, ans=0.0 2023-06-28 20:16:14,418 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.341e+02 7.769e+02 1.310e+03 2.249e+03 6.727e+03, threshold=2.619e+03, percent-clipped=13.0 2023-06-28 20:17:29,439 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.49 vs. limit=8.0 2023-06-28 20:17:51,072 INFO [train.py:996] (2/4) Epoch 12, batch 20800, loss[loss=0.1975, simple_loss=0.2639, pruned_loss=0.06555, over 21571.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.2896, pruned_loss=0.06632, over 4253816.52 frames. ], batch size: 414, lr: 2.40e-03, grad_scale: 32.0 2023-06-28 20:18:01,088 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=2137446.0, ans=10.0 2023-06-28 20:19:33,044 INFO [train.py:996] (2/4) Epoch 12, batch 20850, loss[loss=0.1697, simple_loss=0.2408, pruned_loss=0.04928, over 21286.00 frames. ], tot_loss[loss=0.2063, simple_loss=0.2825, pruned_loss=0.06508, over 4255958.98 frames. ], batch size: 176, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:19:39,614 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.878e+02 7.517e+02 1.058e+03 1.433e+03 3.063e+03, threshold=2.117e+03, percent-clipped=2.0 2023-06-28 20:19:57,881 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.91 vs. limit=22.5 2023-06-28 20:20:07,174 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=2137806.0, ans=0.2 2023-06-28 20:20:21,964 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=2137866.0, ans=0.0 2023-06-28 20:20:28,656 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=2137866.0, ans=0.2 2023-06-28 20:20:33,519 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2137926.0, ans=0.125 2023-06-28 20:21:10,315 INFO [train.py:996] (2/4) Epoch 12, batch 20900, loss[loss=0.2874, simple_loss=0.3499, pruned_loss=0.1125, over 21654.00 frames. ], tot_loss[loss=0.2075, simple_loss=0.2833, pruned_loss=0.06583, over 4264018.58 frames. ], batch size: 509, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:22:04,837 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=2138166.0, ans=0.125 2023-06-28 20:22:23,853 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2138226.0, ans=0.0 2023-06-28 20:22:48,743 INFO [train.py:996] (2/4) Epoch 12, batch 20950, loss[loss=0.1463, simple_loss=0.2189, pruned_loss=0.03685, over 16512.00 frames. ], tot_loss[loss=0.2022, simple_loss=0.2793, pruned_loss=0.06258, over 4246464.21 frames. ], batch size: 63, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:22:55,230 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.730e+02 8.164e+02 1.366e+03 2.074e+03 5.785e+03, threshold=2.733e+03, percent-clipped=24.0 2023-06-28 20:23:04,228 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=2138406.0, ans=0.2 2023-06-28 20:23:24,230 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.21 vs. limit=10.0 2023-06-28 20:23:40,130 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2138466.0, ans=0.0 2023-06-28 20:23:42,437 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=2138526.0, ans=6.0 2023-06-28 20:23:45,574 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.60 vs. limit=22.5 2023-06-28 20:24:02,338 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2138586.0, ans=0.125 2023-06-28 20:24:07,847 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.16 vs. limit=15.0 2023-06-28 20:24:24,231 INFO [train.py:996] (2/4) Epoch 12, batch 21000, loss[loss=0.1621, simple_loss=0.2332, pruned_loss=0.04546, over 18234.00 frames. ], tot_loss[loss=0.2013, simple_loss=0.2775, pruned_loss=0.06261, over 4249367.22 frames. ], batch size: 70, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:24:24,231 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-28 20:24:40,722 INFO [train.py:1028] (2/4) Epoch 12, validation: loss=0.2646, simple_loss=0.357, pruned_loss=0.08608, over 1796401.00 frames. 2023-06-28 20:24:40,723 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 23793MB 2023-06-28 20:24:45,320 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=2138646.0, ans=0.2 2023-06-28 20:24:48,471 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=2138646.0, ans=0.04949747468305833 2023-06-28 20:25:18,790 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=2138706.0, ans=0.0 2023-06-28 20:25:33,487 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2138766.0, ans=0.125 2023-06-28 20:25:57,866 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=2138826.0, ans=0.5 2023-06-28 20:26:15,935 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.27 vs. limit=10.0 2023-06-28 20:26:21,498 INFO [train.py:996] (2/4) Epoch 12, batch 21050, loss[loss=0.2378, simple_loss=0.2808, pruned_loss=0.09736, over 21406.00 frames. ], tot_loss[loss=0.2012, simple_loss=0.2766, pruned_loss=0.06292, over 4247459.80 frames. ], batch size: 509, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:26:28,204 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.708e+02 6.795e+02 9.340e+02 1.308e+03 3.165e+03, threshold=1.868e+03, percent-clipped=2.0 2023-06-28 20:27:10,097 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=2139066.0, ans=0.0 2023-06-28 20:27:10,203 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2139066.0, ans=0.125 2023-06-28 20:27:26,528 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2139126.0, ans=0.125 2023-06-28 20:27:50,506 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2139186.0, ans=0.125 2023-06-28 20:28:01,129 INFO [train.py:996] (2/4) Epoch 12, batch 21100, loss[loss=0.1801, simple_loss=0.252, pruned_loss=0.05411, over 21427.00 frames. ], tot_loss[loss=0.199, simple_loss=0.2731, pruned_loss=0.06245, over 4257617.22 frames. ], batch size: 131, lr: 2.40e-03, grad_scale: 8.0 2023-06-28 20:28:01,712 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=2139246.0, ans=0.125 2023-06-28 20:28:42,608 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.08 vs. limit=15.0 2023-06-28 20:28:47,215 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=2139366.0, ans=0.0 2023-06-28 20:28:52,195 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2139366.0, ans=0.0 2023-06-28 20:28:57,670 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.13 vs. limit=15.0 2023-06-28 20:28:57,674 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.58 vs. limit=6.0 2023-06-28 20:29:32,900 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=2139486.0, ans=0.2 2023-06-28 20:29:41,228 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2139546.0, ans=0.125 2023-06-28 20:29:42,305 INFO [train.py:996] (2/4) Epoch 12, batch 21150, loss[loss=0.1776, simple_loss=0.2413, pruned_loss=0.0569, over 21532.00 frames. ], tot_loss[loss=0.1973, simple_loss=0.2693, pruned_loss=0.06266, over 4261342.58 frames. ], batch size: 263, lr: 2.40e-03, grad_scale: 8.0 2023-06-28 20:29:50,626 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.872e+02 8.259e+02 1.205e+03 1.749e+03 3.220e+03, threshold=2.410e+03, percent-clipped=20.0 2023-06-28 20:31:23,260 INFO [train.py:996] (2/4) Epoch 12, batch 21200, loss[loss=0.1923, simple_loss=0.2593, pruned_loss=0.06267, over 21555.00 frames. ], tot_loss[loss=0.1955, simple_loss=0.2679, pruned_loss=0.06154, over 4250081.78 frames. ], batch size: 414, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:31:24,725 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.96 vs. limit=6.0 2023-06-28 20:32:15,666 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2139966.0, ans=0.0 2023-06-28 20:32:28,387 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=2140026.0, ans=0.125 2023-06-28 20:32:32,290 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.89 vs. limit=15.0 2023-06-28 20:33:04,755 INFO [train.py:996] (2/4) Epoch 12, batch 21250, loss[loss=0.1914, simple_loss=0.2644, pruned_loss=0.0592, over 21169.00 frames. ], tot_loss[loss=0.1939, simple_loss=0.2654, pruned_loss=0.06118, over 4249761.26 frames. ], batch size: 143, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:33:13,143 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.761e+02 7.355e+02 9.747e+02 1.370e+03 2.666e+03, threshold=1.949e+03, percent-clipped=4.0 2023-06-28 20:33:31,863 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=2140206.0, ans=0.125 2023-06-28 20:33:40,026 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=2140206.0, ans=0.0 2023-06-28 20:33:56,492 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=2140266.0, ans=0.05 2023-06-28 20:34:20,271 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=2140326.0, ans=0.2 2023-06-28 20:34:37,241 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.41 vs. limit=15.0 2023-06-28 20:34:44,499 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2140386.0, ans=0.1 2023-06-28 20:34:47,107 INFO [train.py:996] (2/4) Epoch 12, batch 21300, loss[loss=0.2102, simple_loss=0.2809, pruned_loss=0.06976, over 21318.00 frames. ], tot_loss[loss=0.2013, simple_loss=0.2742, pruned_loss=0.06417, over 4258957.63 frames. ], batch size: 176, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:34:47,874 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2140446.0, ans=0.0 2023-06-28 20:34:53,599 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.06 vs. limit=10.0 2023-06-28 20:35:23,664 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=2140506.0, ans=0.125 2023-06-28 20:36:17,371 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=2140686.0, ans=0.035 2023-06-28 20:36:25,558 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=2140686.0, ans=0.0 2023-06-28 20:36:30,007 INFO [train.py:996] (2/4) Epoch 12, batch 21350, loss[loss=0.1808, simple_loss=0.2605, pruned_loss=0.05055, over 21722.00 frames. ], tot_loss[loss=0.2024, simple_loss=0.2768, pruned_loss=0.06401, over 4262067.36 frames. ], batch size: 112, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:36:43,154 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.041e+02 8.389e+02 1.153e+03 1.810e+03 4.461e+03, threshold=2.306e+03, percent-clipped=20.0 2023-06-28 20:38:00,972 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=2140986.0, ans=0.125 2023-06-28 20:38:16,932 INFO [train.py:996] (2/4) Epoch 12, batch 21400, loss[loss=0.2109, simple_loss=0.292, pruned_loss=0.0649, over 20674.00 frames. ], tot_loss[loss=0.2036, simple_loss=0.28, pruned_loss=0.06366, over 4261385.12 frames. ], batch size: 607, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:38:34,137 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.51 vs. limit=22.5 2023-06-28 20:38:38,656 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2141106.0, ans=0.0 2023-06-28 20:38:51,529 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=2141106.0, ans=0.125 2023-06-28 20:39:57,097 INFO [train.py:996] (2/4) Epoch 12, batch 21450, loss[loss=0.2057, simple_loss=0.2755, pruned_loss=0.06797, over 21292.00 frames. ], tot_loss[loss=0.2059, simple_loss=0.2824, pruned_loss=0.06469, over 4264951.72 frames. ], batch size: 159, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:40:04,992 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.076e+02 7.437e+02 1.005e+03 1.722e+03 2.921e+03, threshold=2.009e+03, percent-clipped=6.0 2023-06-28 20:40:13,744 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2141346.0, ans=0.125 2023-06-28 20:40:15,373 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2141346.0, ans=0.125 2023-06-28 20:40:28,310 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=2141406.0, ans=0.2 2023-06-28 20:41:38,459 INFO [train.py:996] (2/4) Epoch 12, batch 21500, loss[loss=0.2279, simple_loss=0.2784, pruned_loss=0.08868, over 21314.00 frames. ], tot_loss[loss=0.2059, simple_loss=0.2803, pruned_loss=0.06579, over 4262117.66 frames. ], batch size: 473, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:42:15,111 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2141706.0, ans=0.125 2023-06-28 20:43:04,246 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=2141886.0, ans=0.035 2023-06-28 20:43:05,960 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=2141886.0, ans=0.125 2023-06-28 20:43:19,797 INFO [train.py:996] (2/4) Epoch 12, batch 21550, loss[loss=0.1639, simple_loss=0.2374, pruned_loss=0.0452, over 21793.00 frames. ], tot_loss[loss=0.2004, simple_loss=0.2734, pruned_loss=0.06364, over 4252775.73 frames. ], batch size: 317, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:43:32,811 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.830e+02 7.462e+02 9.978e+02 1.500e+03 2.892e+03, threshold=1.996e+03, percent-clipped=12.0 2023-06-28 20:45:02,659 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2142246.0, ans=0.1 2023-06-28 20:45:02,768 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2142246.0, ans=0.0 2023-06-28 20:45:03,650 INFO [train.py:996] (2/4) Epoch 12, batch 21600, loss[loss=0.2069, simple_loss=0.2582, pruned_loss=0.07781, over 21342.00 frames. ], tot_loss[loss=0.1967, simple_loss=0.2689, pruned_loss=0.06224, over 4260781.74 frames. ], batch size: 473, lr: 2.40e-03, grad_scale: 32.0 2023-06-28 20:45:50,899 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.69 vs. limit=15.0 2023-06-28 20:46:39,416 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=2142486.0, ans=0.05 2023-06-28 20:46:51,884 INFO [train.py:996] (2/4) Epoch 12, batch 21650, loss[loss=0.1909, simple_loss=0.2905, pruned_loss=0.04566, over 21650.00 frames. ], tot_loss[loss=0.1977, simple_loss=0.2743, pruned_loss=0.06056, over 4257175.27 frames. ], batch size: 263, lr: 2.40e-03, grad_scale: 8.0 2023-06-28 20:47:03,118 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.132e+02 8.434e+02 1.336e+03 2.286e+03 3.969e+03, threshold=2.673e+03, percent-clipped=30.0 2023-06-28 20:47:28,414 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=2142666.0, ans=0.07 2023-06-28 20:48:21,043 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2142786.0, ans=0.125 2023-06-28 20:48:26,751 INFO [train.py:996] (2/4) Epoch 12, batch 21700, loss[loss=0.2168, simple_loss=0.2737, pruned_loss=0.07997, over 21251.00 frames. ], tot_loss[loss=0.198, simple_loss=0.2763, pruned_loss=0.0598, over 4260370.75 frames. ], batch size: 471, lr: 2.40e-03, grad_scale: 8.0 2023-06-28 20:48:37,220 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 20:49:12,137 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=2142966.0, ans=0.2 2023-06-28 20:49:15,477 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=2142966.0, ans=0.125 2023-06-28 20:49:28,400 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2142966.0, ans=0.1 2023-06-28 20:50:07,596 INFO [train.py:996] (2/4) Epoch 12, batch 21750, loss[loss=0.1754, simple_loss=0.2447, pruned_loss=0.0531, over 21387.00 frames. ], tot_loss[loss=0.1954, simple_loss=0.2716, pruned_loss=0.05959, over 4252015.12 frames. ], batch size: 195, lr: 2.40e-03, grad_scale: 8.0 2023-06-28 20:50:24,255 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.270e+02 7.010e+02 1.001e+03 1.482e+03 3.293e+03, threshold=2.002e+03, percent-clipped=2.0 2023-06-28 20:50:39,280 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=2143206.0, ans=0.2 2023-06-28 20:50:50,568 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=2143266.0, ans=0.0 2023-06-28 20:51:33,126 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.82 vs. limit=6.0 2023-06-28 20:51:54,921 INFO [train.py:996] (2/4) Epoch 12, batch 21800, loss[loss=0.2332, simple_loss=0.3153, pruned_loss=0.07557, over 21658.00 frames. ], tot_loss[loss=0.1953, simple_loss=0.2695, pruned_loss=0.06049, over 4260433.18 frames. ], batch size: 391, lr: 2.40e-03, grad_scale: 8.0 2023-06-28 20:51:55,690 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=2143446.0, ans=0.125 2023-06-28 20:52:01,175 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.50 vs. limit=22.5 2023-06-28 20:52:17,475 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2143506.0, ans=0.125 2023-06-28 20:53:15,156 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.90 vs. limit=15.0 2023-06-28 20:53:36,717 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.00 vs. limit=10.0 2023-06-28 20:53:37,020 INFO [train.py:996] (2/4) Epoch 12, batch 21850, loss[loss=0.1904, simple_loss=0.2709, pruned_loss=0.05492, over 20996.00 frames. ], tot_loss[loss=0.198, simple_loss=0.2753, pruned_loss=0.06041, over 4259695.50 frames. ], batch size: 607, lr: 2.40e-03, grad_scale: 8.0 2023-06-28 20:53:39,636 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.17 vs. limit=15.0 2023-06-28 20:53:48,652 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.108e+02 8.276e+02 1.227e+03 1.863e+03 4.037e+03, threshold=2.455e+03, percent-clipped=20.0 2023-06-28 20:54:24,389 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=2143866.0, ans=0.2 2023-06-28 20:55:18,295 INFO [train.py:996] (2/4) Epoch 12, batch 21900, loss[loss=0.1724, simple_loss=0.2399, pruned_loss=0.05244, over 21700.00 frames. ], tot_loss[loss=0.2003, simple_loss=0.2763, pruned_loss=0.06222, over 4265384.57 frames. ], batch size: 264, lr: 2.40e-03, grad_scale: 8.0 2023-06-28 20:55:55,732 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=2144166.0, ans=0.0 2023-06-28 20:56:11,856 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=2144166.0, ans=0.2 2023-06-28 20:56:50,870 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2144286.0, ans=0.125 2023-06-28 20:56:58,123 INFO [train.py:996] (2/4) Epoch 12, batch 21950, loss[loss=0.1856, simple_loss=0.2451, pruned_loss=0.0631, over 21840.00 frames. ], tot_loss[loss=0.1973, simple_loss=0.2712, pruned_loss=0.06163, over 4270505.40 frames. ], batch size: 107, lr: 2.40e-03, grad_scale: 8.0 2023-06-28 20:57:02,097 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2144346.0, ans=0.1 2023-06-28 20:57:03,618 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=2144346.0, ans=0.2 2023-06-28 20:57:06,920 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=2144346.0, ans=0.04949747468305833 2023-06-28 20:57:09,578 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.558e+02 7.761e+02 1.147e+03 1.869e+03 4.092e+03, threshold=2.294e+03, percent-clipped=9.0 2023-06-28 20:57:21,432 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=2144406.0, ans=0.07 2023-06-28 20:57:44,244 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=2144466.0, ans=0.2 2023-06-28 20:57:54,558 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.14 vs. limit=15.0 2023-06-28 20:58:40,386 INFO [train.py:996] (2/4) Epoch 12, batch 22000, loss[loss=0.1799, simple_loss=0.2489, pruned_loss=0.05545, over 21622.00 frames. ], tot_loss[loss=0.191, simple_loss=0.2656, pruned_loss=0.05822, over 4250824.47 frames. ], batch size: 298, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:58:41,206 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=2144646.0, ans=0.0 2023-06-28 20:59:04,840 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2144706.0, ans=0.125 2023-06-28 20:59:10,632 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.60 vs. limit=22.5 2023-06-28 20:59:37,210 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=2144766.0, ans=0.95 2023-06-28 20:59:55,218 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=2144826.0, ans=0.2 2023-06-28 20:59:58,598 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=2144826.0, ans=0.04949747468305833 2023-06-28 21:00:23,765 INFO [train.py:996] (2/4) Epoch 12, batch 22050, loss[loss=0.2517, simple_loss=0.3203, pruned_loss=0.09152, over 21256.00 frames. ], tot_loss[loss=0.1976, simple_loss=0.2726, pruned_loss=0.06128, over 4243193.60 frames. ], batch size: 159, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 21:00:40,621 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.182e+02 7.125e+02 1.182e+03 1.630e+03 4.961e+03, threshold=2.364e+03, percent-clipped=13.0 2023-06-28 21:00:49,935 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.94 vs. limit=15.0 2023-06-28 21:01:00,264 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.38 vs. limit=12.0 2023-06-28 21:01:11,132 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=2145066.0, ans=0.2 2023-06-28 21:01:25,459 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2145066.0, ans=0.125 2023-06-28 21:02:01,816 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2145186.0, ans=0.1 2023-06-28 21:02:06,215 INFO [train.py:996] (2/4) Epoch 12, batch 22100, loss[loss=0.1986, simple_loss=0.2765, pruned_loss=0.06036, over 21960.00 frames. ], tot_loss[loss=0.2072, simple_loss=0.2831, pruned_loss=0.06567, over 4245226.46 frames. ], batch size: 316, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 21:03:36,908 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2145486.0, ans=0.1 2023-06-28 21:03:47,969 INFO [train.py:996] (2/4) Epoch 12, batch 22150, loss[loss=0.233, simple_loss=0.2997, pruned_loss=0.0832, over 21766.00 frames. ], tot_loss[loss=0.2088, simple_loss=0.2843, pruned_loss=0.0667, over 4258041.03 frames. ], batch size: 441, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 21:04:04,053 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.507e+02 8.832e+02 1.298e+03 1.809e+03 3.590e+03, threshold=2.596e+03, percent-clipped=11.0 2023-06-28 21:04:19,460 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2145606.0, ans=0.0 2023-06-28 21:04:21,316 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2145606.0, ans=0.0 2023-06-28 21:04:23,167 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=2145606.0, ans=0.2 2023-06-28 21:05:29,507 INFO [train.py:996] (2/4) Epoch 12, batch 22200, loss[loss=0.1887, simple_loss=0.2611, pruned_loss=0.05817, over 21699.00 frames. ], tot_loss[loss=0.209, simple_loss=0.2843, pruned_loss=0.06685, over 4274909.64 frames. ], batch size: 263, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 21:07:00,011 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=2146086.0, ans=0.125 2023-06-28 21:07:02,227 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.89 vs. limit=15.0 2023-06-28 21:07:17,285 INFO [train.py:996] (2/4) Epoch 12, batch 22250, loss[loss=0.2212, simple_loss=0.3033, pruned_loss=0.06955, over 21668.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.291, pruned_loss=0.06842, over 4273290.05 frames. ], batch size: 230, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 21:07:29,286 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.277e+02 8.206e+02 1.186e+03 1.604e+03 3.301e+03, threshold=2.372e+03, percent-clipped=3.0 2023-06-28 21:07:29,885 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=2146146.0, ans=0.125 2023-06-28 21:08:23,502 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.36 vs. limit=12.0 2023-06-28 21:08:54,985 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2146386.0, ans=0.125 2023-06-28 21:08:57,735 INFO [train.py:996] (2/4) Epoch 12, batch 22300, loss[loss=0.2538, simple_loss=0.307, pruned_loss=0.1003, over 21763.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2942, pruned_loss=0.06987, over 4274377.78 frames. ], batch size: 508, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 21:09:46,844 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2146566.0, ans=0.1 2023-06-28 21:09:51,624 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2146566.0, ans=0.1 2023-06-28 21:09:59,083 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.28 vs. limit=12.0 2023-06-28 21:10:04,879 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2146626.0, ans=0.1 2023-06-28 21:10:21,164 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=2146686.0, ans=0.0 2023-06-28 21:10:38,575 INFO [train.py:996] (2/4) Epoch 12, batch 22350, loss[loss=0.1988, simple_loss=0.2704, pruned_loss=0.0636, over 21452.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.2924, pruned_loss=0.07054, over 4282420.90 frames. ], batch size: 177, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 21:10:39,209 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=2146746.0, ans=0.125 2023-06-28 21:10:50,161 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.715e+02 7.662e+02 1.007e+03 1.656e+03 3.932e+03, threshold=2.013e+03, percent-clipped=14.0 2023-06-28 21:10:50,772 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2146746.0, ans=0.125 2023-06-28 21:11:32,021 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=2146866.0, ans=0.2 2023-06-28 21:11:32,024 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2146866.0, ans=0.1 2023-06-28 21:12:20,273 INFO [train.py:996] (2/4) Epoch 12, batch 22400, loss[loss=0.2043, simple_loss=0.2888, pruned_loss=0.05991, over 21764.00 frames. ], tot_loss[loss=0.2115, simple_loss=0.2895, pruned_loss=0.06681, over 4263452.05 frames. ], batch size: 351, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 21:13:27,161 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=2147226.0, ans=0.125 2023-06-28 21:13:51,583 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2147286.0, ans=0.125 2023-06-28 21:14:04,832 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.whiten.whitening_limit, batch_count=2147346.0, ans=12.0 2023-06-28 21:14:05,224 INFO [train.py:996] (2/4) Epoch 12, batch 22450, loss[loss=0.2278, simple_loss=0.2668, pruned_loss=0.09445, over 21398.00 frames. ], tot_loss[loss=0.2079, simple_loss=0.2834, pruned_loss=0.06619, over 4257049.66 frames. ], batch size: 509, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 21:14:13,140 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.62 vs. limit=15.0 2023-06-28 21:14:18,831 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.767e+02 6.974e+02 9.708e+02 1.486e+03 4.519e+03, threshold=1.942e+03, percent-clipped=14.0 2023-06-28 21:15:48,357 INFO [train.py:996] (2/4) Epoch 12, batch 22500, loss[loss=0.1636, simple_loss=0.2162, pruned_loss=0.0555, over 20714.00 frames. ], tot_loss[loss=0.2038, simple_loss=0.2775, pruned_loss=0.06503, over 4268766.46 frames. ], batch size: 607, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 21:16:31,908 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2147766.0, ans=0.0 2023-06-28 21:17:11,637 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2147886.0, ans=0.1 2023-06-28 21:17:22,898 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2147886.0, ans=0.125 2023-06-28 21:17:31,325 INFO [train.py:996] (2/4) Epoch 12, batch 22550, loss[loss=0.1941, simple_loss=0.2663, pruned_loss=0.06096, over 21549.00 frames. ], tot_loss[loss=0.2073, simple_loss=0.2829, pruned_loss=0.06583, over 4273558.45 frames. ], batch size: 548, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 21:17:43,327 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2147946.0, ans=0.125 2023-06-28 21:17:49,758 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.593e+02 9.385e+02 1.394e+03 1.973e+03 3.224e+03, threshold=2.788e+03, percent-clipped=25.0 2023-06-28 21:18:41,712 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2148126.0, ans=0.1 2023-06-28 21:18:53,732 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.89 vs. limit=15.0 2023-06-28 21:19:20,504 INFO [train.py:996] (2/4) Epoch 12, batch 22600, loss[loss=0.1892, simple_loss=0.2624, pruned_loss=0.05799, over 21624.00 frames. ], tot_loss[loss=0.2081, simple_loss=0.2841, pruned_loss=0.06606, over 4274175.56 frames. ], batch size: 230, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 21:19:28,189 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2148246.0, ans=0.125 2023-06-28 21:19:54,732 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.78 vs. limit=10.0 2023-06-28 21:20:31,596 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2148426.0, ans=0.0 2023-06-28 21:20:44,853 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2148486.0, ans=0.0 2023-06-28 21:21:00,870 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=2148546.0, ans=0.2 2023-06-28 21:21:01,900 INFO [train.py:996] (2/4) Epoch 12, batch 22650, loss[loss=0.1892, simple_loss=0.2561, pruned_loss=0.06119, over 21762.00 frames. ], tot_loss[loss=0.2072, simple_loss=0.2822, pruned_loss=0.06613, over 4273330.15 frames. ], batch size: 112, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 21:21:07,661 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2148546.0, ans=0.0 2023-06-28 21:21:10,686 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2148546.0, ans=0.125 2023-06-28 21:21:14,882 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.226e+02 9.650e+02 1.395e+03 1.973e+03 4.081e+03, threshold=2.791e+03, percent-clipped=9.0 2023-06-28 21:21:18,767 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=2148606.0, ans=0.125 2023-06-28 21:22:14,971 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2148726.0, ans=0.1 2023-06-28 21:22:18,209 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=2148726.0, ans=0.125 2023-06-28 21:22:41,742 INFO [train.py:996] (2/4) Epoch 12, batch 22700, loss[loss=0.2389, simple_loss=0.2757, pruned_loss=0.101, over 21400.00 frames. ], tot_loss[loss=0.2036, simple_loss=0.2761, pruned_loss=0.06553, over 4270162.12 frames. ], batch size: 509, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 21:22:46,032 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2148846.0, ans=0.125 2023-06-28 21:22:59,105 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=2148906.0, ans=0.2 2023-06-28 21:24:24,383 INFO [train.py:996] (2/4) Epoch 12, batch 22750, loss[loss=0.2275, simple_loss=0.3001, pruned_loss=0.0775, over 21776.00 frames. ], tot_loss[loss=0.2051, simple_loss=0.2768, pruned_loss=0.06668, over 4265458.76 frames. ], batch size: 332, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 21:24:36,923 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=2149146.0, ans=0.5 2023-06-28 21:24:37,891 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.797e+02 7.718e+02 1.201e+03 1.681e+03 3.626e+03, threshold=2.402e+03, percent-clipped=4.0 2023-06-28 21:24:44,127 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.12 vs. limit=10.0 2023-06-28 21:25:17,559 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=2149266.0, ans=0.0 2023-06-28 21:25:29,096 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=2149326.0, ans=0.035 2023-06-28 21:26:05,755 INFO [train.py:996] (2/4) Epoch 12, batch 22800, loss[loss=0.1932, simple_loss=0.2677, pruned_loss=0.05934, over 21421.00 frames. ], tot_loss[loss=0.2089, simple_loss=0.2814, pruned_loss=0.06816, over 4268918.06 frames. ], batch size: 211, lr: 2.39e-03, grad_scale: 32.0 2023-06-28 21:26:24,669 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=2149446.0, ans=0.0 2023-06-28 21:26:25,139 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.31 vs. limit=15.0 2023-06-28 21:26:32,570 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=2149506.0, ans=0.025 2023-06-28 21:26:46,782 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2149566.0, ans=0.125 2023-06-28 21:26:58,014 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2149566.0, ans=0.0 2023-06-28 21:27:19,354 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2149626.0, ans=0.125 2023-06-28 21:27:45,951 INFO [train.py:996] (2/4) Epoch 12, batch 22850, loss[loss=0.1835, simple_loss=0.2523, pruned_loss=0.05729, over 21511.00 frames. ], tot_loss[loss=0.2069, simple_loss=0.2785, pruned_loss=0.0677, over 4277766.16 frames. ], batch size: 230, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 21:28:00,265 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=2149746.0, ans=0.125 2023-06-28 21:28:01,320 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.804e+02 7.642e+02 1.050e+03 1.882e+03 3.484e+03, threshold=2.099e+03, percent-clipped=13.0 2023-06-28 21:28:04,132 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2149806.0, ans=0.1 2023-06-28 21:28:24,480 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.55 vs. limit=15.0 2023-06-28 21:28:53,476 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2149926.0, ans=0.1 2023-06-28 21:29:07,413 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.82 vs. limit=6.0 2023-06-28 21:29:30,174 INFO [train.py:996] (2/4) Epoch 12, batch 22900, loss[loss=0.2215, simple_loss=0.3224, pruned_loss=0.06029, over 21820.00 frames. ], tot_loss[loss=0.2082, simple_loss=0.282, pruned_loss=0.06722, over 4274006.88 frames. ], batch size: 371, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 21:29:52,755 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2150106.0, ans=0.0 2023-06-28 21:30:06,250 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.21 vs. limit=15.0 2023-06-28 21:30:41,442 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2150226.0, ans=0.125 2023-06-28 21:30:56,775 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=2150286.0, ans=0.125 2023-06-28 21:30:58,796 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2150286.0, ans=0.125 2023-06-28 21:31:00,866 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.52 vs. limit=12.0 2023-06-28 21:31:09,054 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=2150286.0, ans=0.5 2023-06-28 21:31:09,064 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2150286.0, ans=0.125 2023-06-28 21:31:19,857 INFO [train.py:996] (2/4) Epoch 12, batch 22950, loss[loss=0.2161, simple_loss=0.3217, pruned_loss=0.05527, over 21298.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2943, pruned_loss=0.06623, over 4273653.74 frames. ], batch size: 176, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 21:31:22,020 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2150346.0, ans=0.125 2023-06-28 21:31:39,645 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.589e+02 9.756e+02 1.509e+03 2.315e+03 4.900e+03, threshold=3.017e+03, percent-clipped=30.0 2023-06-28 21:31:48,622 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2150406.0, ans=0.125 2023-06-28 21:32:46,629 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=2150586.0, ans=0.125 2023-06-28 21:33:02,911 INFO [train.py:996] (2/4) Epoch 12, batch 23000, loss[loss=0.2004, simple_loss=0.2758, pruned_loss=0.06244, over 21790.00 frames. ], tot_loss[loss=0.2118, simple_loss=0.2948, pruned_loss=0.06442, over 4269503.77 frames. ], batch size: 247, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 21:33:35,380 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2150706.0, ans=0.0 2023-06-28 21:33:40,869 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.37 vs. limit=22.5 2023-06-28 21:34:06,182 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=2150826.0, ans=0.0 2023-06-28 21:34:21,407 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2150826.0, ans=0.125 2023-06-28 21:34:32,157 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.43 vs. limit=12.0 2023-06-28 21:34:51,534 INFO [train.py:996] (2/4) Epoch 12, batch 23050, loss[loss=0.215, simple_loss=0.2895, pruned_loss=0.0703, over 20669.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.2946, pruned_loss=0.06589, over 4278279.41 frames. ], batch size: 608, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 21:34:52,209 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=2150946.0, ans=0.025 2023-06-28 21:35:10,952 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.719e+02 9.558e+02 1.419e+03 1.890e+03 3.669e+03, threshold=2.838e+03, percent-clipped=6.0 2023-06-28 21:35:11,529 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2151006.0, ans=0.1 2023-06-28 21:35:14,889 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=2151006.0, ans=0.0 2023-06-28 21:36:17,498 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2151186.0, ans=0.125 2023-06-28 21:36:20,610 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 21:36:34,601 INFO [train.py:996] (2/4) Epoch 12, batch 23100, loss[loss=0.178, simple_loss=0.2469, pruned_loss=0.05451, over 21782.00 frames. ], tot_loss[loss=0.2117, simple_loss=0.2907, pruned_loss=0.06641, over 4278571.81 frames. ], batch size: 317, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 21:36:39,075 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=2151246.0, ans=22.5 2023-06-28 21:37:01,617 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=2151306.0, ans=0.2 2023-06-28 21:37:02,218 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.94 vs. limit=12.0 2023-06-28 21:37:11,255 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 21:37:32,834 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2151426.0, ans=0.0 2023-06-28 21:37:44,509 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2151426.0, ans=0.125 2023-06-28 21:38:13,712 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 21:38:16,279 INFO [train.py:996] (2/4) Epoch 12, batch 23150, loss[loss=0.1911, simple_loss=0.2591, pruned_loss=0.06157, over 21567.00 frames. ], tot_loss[loss=0.2081, simple_loss=0.2845, pruned_loss=0.06588, over 4273105.65 frames. ], batch size: 548, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 21:38:30,884 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.983e+02 7.198e+02 1.006e+03 1.345e+03 2.860e+03, threshold=2.012e+03, percent-clipped=2.0 2023-06-28 21:39:15,476 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=2151726.0, ans=0.0 2023-06-28 21:39:57,506 INFO [train.py:996] (2/4) Epoch 12, batch 23200, loss[loss=0.2529, simple_loss=0.3053, pruned_loss=0.1002, over 21790.00 frames. ], tot_loss[loss=0.2086, simple_loss=0.2839, pruned_loss=0.06662, over 4283828.82 frames. ], batch size: 508, lr: 2.39e-03, grad_scale: 32.0 2023-06-28 21:40:14,442 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2151906.0, ans=0.125 2023-06-28 21:41:38,932 INFO [train.py:996] (2/4) Epoch 12, batch 23250, loss[loss=0.188, simple_loss=0.2446, pruned_loss=0.06569, over 21238.00 frames. ], tot_loss[loss=0.2088, simple_loss=0.2833, pruned_loss=0.06719, over 4288048.37 frames. ], batch size: 608, lr: 2.39e-03, grad_scale: 32.0 2023-06-28 21:41:42,912 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2152146.0, ans=0.1 2023-06-28 21:41:52,768 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=2152146.0, ans=0.0 2023-06-28 21:41:58,599 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.316e+02 9.370e+02 1.450e+03 2.114e+03 3.490e+03, threshold=2.900e+03, percent-clipped=30.0 2023-06-28 21:42:05,781 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2152206.0, ans=0.125 2023-06-28 21:43:22,229 INFO [train.py:996] (2/4) Epoch 12, batch 23300, loss[loss=0.272, simple_loss=0.3803, pruned_loss=0.0819, over 21847.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.2898, pruned_loss=0.06818, over 4283588.93 frames. ], batch size: 371, lr: 2.39e-03, grad_scale: 32.0 2023-06-28 21:43:57,855 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.43 vs. limit=10.0 2023-06-28 21:43:59,728 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=2152506.0, ans=15.0 2023-06-28 21:44:03,921 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 21:44:12,903 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2152566.0, ans=0.125 2023-06-28 21:45:09,873 INFO [train.py:996] (2/4) Epoch 12, batch 23350, loss[loss=0.1754, simple_loss=0.2682, pruned_loss=0.04127, over 21615.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.2935, pruned_loss=0.06765, over 4272296.73 frames. ], batch size: 389, lr: 2.39e-03, grad_scale: 8.0 2023-06-28 21:45:14,302 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.07 vs. limit=10.0 2023-06-28 21:45:33,278 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.157e+02 1.010e+03 1.481e+03 2.093e+03 4.806e+03, threshold=2.962e+03, percent-clipped=5.0 2023-06-28 21:45:52,170 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=2152866.0, ans=0.125 2023-06-28 21:45:56,726 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2152866.0, ans=0.1 2023-06-28 21:45:58,516 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=2152866.0, ans=0.04949747468305833 2023-06-28 21:46:11,181 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=2152926.0, ans=0.2 2023-06-28 21:46:51,355 INFO [train.py:996] (2/4) Epoch 12, batch 23400, loss[loss=0.1713, simple_loss=0.275, pruned_loss=0.03381, over 20747.00 frames. ], tot_loss[loss=0.2098, simple_loss=0.2891, pruned_loss=0.06522, over 4274208.41 frames. ], batch size: 607, lr: 2.39e-03, grad_scale: 8.0 2023-06-28 21:46:56,967 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=2153046.0, ans=0.2 2023-06-28 21:48:32,839 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.34 vs. limit=15.0 2023-06-28 21:48:38,219 INFO [train.py:996] (2/4) Epoch 12, batch 23450, loss[loss=0.2217, simple_loss=0.2931, pruned_loss=0.07513, over 21337.00 frames. ], tot_loss[loss=0.2117, simple_loss=0.2894, pruned_loss=0.06694, over 4275780.73 frames. ], batch size: 176, lr: 2.39e-03, grad_scale: 8.0 2023-06-28 21:48:56,418 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.004e+02 7.180e+02 1.083e+03 1.740e+03 4.594e+03, threshold=2.165e+03, percent-clipped=4.0 2023-06-28 21:49:05,664 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2153406.0, ans=0.1 2023-06-28 21:49:39,801 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2153526.0, ans=0.125 2023-06-28 21:49:57,956 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.57 vs. limit=15.0 2023-06-28 21:50:03,928 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=2153586.0, ans=0.125 2023-06-28 21:50:19,189 INFO [train.py:996] (2/4) Epoch 12, batch 23500, loss[loss=0.2005, simple_loss=0.2735, pruned_loss=0.06377, over 21937.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.2892, pruned_loss=0.06804, over 4269482.31 frames. ], batch size: 316, lr: 2.39e-03, grad_scale: 8.0 2023-06-28 21:50:32,711 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2153646.0, ans=0.125 2023-06-28 21:51:12,500 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=2153766.0, ans=0.0 2023-06-28 21:51:27,384 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2153826.0, ans=0.1 2023-06-28 21:51:31,305 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.51 vs. limit=22.5 2023-06-28 21:51:47,382 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.29 vs. limit=6.0 2023-06-28 21:51:56,099 INFO [train.py:996] (2/4) Epoch 12, batch 23550, loss[loss=0.1723, simple_loss=0.2381, pruned_loss=0.05325, over 21643.00 frames. ], tot_loss[loss=0.21, simple_loss=0.2852, pruned_loss=0.06741, over 4260968.60 frames. ], batch size: 282, lr: 2.39e-03, grad_scale: 8.0 2023-06-28 21:52:18,943 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.122e+02 7.386e+02 1.223e+03 1.985e+03 5.110e+03, threshold=2.446e+03, percent-clipped=21.0 2023-06-28 21:52:26,389 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=2154006.0, ans=0.05 2023-06-28 21:53:06,170 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=2154126.0, ans=0.125 2023-06-28 21:53:43,339 INFO [train.py:996] (2/4) Epoch 12, batch 23600, loss[loss=0.2074, simple_loss=0.277, pruned_loss=0.06892, over 21823.00 frames. ], tot_loss[loss=0.2106, simple_loss=0.2858, pruned_loss=0.06774, over 4261516.64 frames. ], batch size: 98, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 21:54:21,604 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.55 vs. limit=15.0 2023-06-28 21:55:04,061 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.89 vs. limit=12.0 2023-06-28 21:55:23,994 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2154486.0, ans=0.0 2023-06-28 21:55:26,549 INFO [train.py:996] (2/4) Epoch 12, batch 23650, loss[loss=0.2429, simple_loss=0.3146, pruned_loss=0.08566, over 21297.00 frames. ], tot_loss[loss=0.2085, simple_loss=0.2856, pruned_loss=0.06568, over 4269644.17 frames. ], batch size: 176, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 21:55:50,245 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.631e+02 9.498e+02 1.627e+03 2.545e+03 5.743e+03, threshold=3.254e+03, percent-clipped=28.0 2023-06-28 21:56:06,271 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.07 vs. limit=15.0 2023-06-28 21:56:17,694 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2154666.0, ans=0.125 2023-06-28 21:57:10,360 INFO [train.py:996] (2/4) Epoch 12, batch 23700, loss[loss=0.2307, simple_loss=0.3076, pruned_loss=0.07689, over 21425.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2884, pruned_loss=0.06625, over 4265392.64 frames. ], batch size: 471, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 21:58:39,557 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2155086.0, ans=0.125 2023-06-28 21:58:41,293 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=2155086.0, ans=0.125 2023-06-28 21:58:44,743 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2155086.0, ans=0.0 2023-06-28 21:58:58,953 INFO [train.py:996] (2/4) Epoch 12, batch 23750, loss[loss=0.1755, simple_loss=0.2652, pruned_loss=0.0429, over 21420.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.2917, pruned_loss=0.0673, over 4272469.38 frames. ], batch size: 211, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 21:59:21,760 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.921e+02 7.434e+02 9.463e+02 1.338e+03 4.159e+03, threshold=1.893e+03, percent-clipped=3.0 2023-06-28 21:59:32,680 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2155206.0, ans=0.125 2023-06-28 21:59:48,625 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2155266.0, ans=0.125 2023-06-28 21:59:57,172 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=2155266.0, ans=0.125 2023-06-28 22:00:47,763 INFO [train.py:996] (2/4) Epoch 12, batch 23800, loss[loss=0.2885, simple_loss=0.3726, pruned_loss=0.1022, over 21585.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.2908, pruned_loss=0.06573, over 4273120.47 frames. ], batch size: 441, lr: 2.39e-03, grad_scale: 8.0 2023-06-28 22:01:11,416 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.70 vs. limit=15.0 2023-06-28 22:02:07,994 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.27 vs. limit=15.0 2023-06-28 22:02:24,135 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2155686.0, ans=0.125 2023-06-28 22:02:36,535 INFO [train.py:996] (2/4) Epoch 12, batch 23850, loss[loss=0.2184, simple_loss=0.3009, pruned_loss=0.06799, over 21811.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2995, pruned_loss=0.06785, over 4275745.52 frames. ], batch size: 282, lr: 2.39e-03, grad_scale: 8.0 2023-06-28 22:02:42,139 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2155746.0, ans=0.1 2023-06-28 22:03:01,487 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.378e+02 9.558e+02 1.642e+03 2.659e+03 5.260e+03, threshold=3.284e+03, percent-clipped=38.0 2023-06-28 22:03:15,685 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=2155866.0, ans=0.0 2023-06-28 22:03:27,554 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2155866.0, ans=0.125 2023-06-28 22:04:19,076 INFO [train.py:996] (2/4) Epoch 12, batch 23900, loss[loss=0.1955, simple_loss=0.2787, pruned_loss=0.05621, over 21720.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.3054, pruned_loss=0.06955, over 4279011.14 frames. ], batch size: 282, lr: 2.39e-03, grad_scale: 8.0 2023-06-28 22:05:56,386 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2156286.0, ans=0.125 2023-06-28 22:06:00,813 INFO [train.py:996] (2/4) Epoch 12, batch 23950, loss[loss=0.1914, simple_loss=0.2551, pruned_loss=0.0639, over 21277.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.2976, pruned_loss=0.06853, over 4278878.70 frames. ], batch size: 177, lr: 2.39e-03, grad_scale: 8.0 2023-06-28 22:06:25,837 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.975e+02 6.890e+02 9.042e+02 1.238e+03 2.308e+03, threshold=1.808e+03, percent-clipped=0.0 2023-06-28 22:06:26,539 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=2156406.0, ans=0.125 2023-06-28 22:07:48,526 INFO [train.py:996] (2/4) Epoch 12, batch 24000, loss[loss=0.2238, simple_loss=0.2983, pruned_loss=0.07468, over 21977.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.2988, pruned_loss=0.07132, over 4280247.94 frames. ], batch size: 317, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 22:07:48,526 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-28 22:07:57,297 INFO [zipformer.py:1728] (2/4) name=encoder.encoders.2.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([2.0938, 2.7489, 4.3150, 3.2112], device='cuda:2') 2023-06-28 22:08:05,137 INFO [train.py:1028] (2/4) Epoch 12, validation: loss=0.264, simple_loss=0.3553, pruned_loss=0.08634, over 1796401.00 frames. 2023-06-28 22:08:05,138 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 23793MB 2023-06-28 22:09:31,763 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2156886.0, ans=0.1 2023-06-28 22:09:31,802 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2156886.0, ans=0.0 2023-06-28 22:09:32,301 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.35 vs. limit=15.0 2023-06-28 22:09:49,057 INFO [train.py:996] (2/4) Epoch 12, batch 24050, loss[loss=0.1866, simple_loss=0.2747, pruned_loss=0.04923, over 21492.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.3015, pruned_loss=0.07215, over 4284597.84 frames. ], batch size: 211, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 22:09:59,850 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=2156946.0, ans=0.125 2023-06-28 22:10:13,146 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2157006.0, ans=0.125 2023-06-28 22:10:14,179 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.391e+02 8.286e+02 1.353e+03 2.052e+03 4.335e+03, threshold=2.707e+03, percent-clipped=33.0 2023-06-28 22:10:22,645 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2157006.0, ans=0.125 2023-06-28 22:10:26,090 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2157006.0, ans=0.125 2023-06-28 22:10:52,447 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2157126.0, ans=0.125 2023-06-28 22:11:09,040 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2157126.0, ans=0.125 2023-06-28 22:11:23,824 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=2157186.0, ans=0.07 2023-06-28 22:11:23,846 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 22:11:31,533 INFO [train.py:996] (2/4) Epoch 12, batch 24100, loss[loss=0.2364, simple_loss=0.3163, pruned_loss=0.07827, over 21423.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.3003, pruned_loss=0.07058, over 4283993.92 frames. ], batch size: 211, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 22:12:04,642 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.54 vs. limit=12.0 2023-06-28 22:12:07,376 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=2157306.0, ans=10.0 2023-06-28 22:13:12,810 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.63 vs. limit=22.5 2023-06-28 22:13:13,306 INFO [train.py:996] (2/4) Epoch 12, batch 24150, loss[loss=0.2273, simple_loss=0.2935, pruned_loss=0.0805, over 21479.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.3004, pruned_loss=0.07214, over 4288769.38 frames. ], batch size: 548, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 22:13:17,805 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.31 vs. limit=15.0 2023-06-28 22:13:43,013 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.450e+02 8.470e+02 1.133e+03 1.588e+03 3.416e+03, threshold=2.267e+03, percent-clipped=5.0 2023-06-28 22:13:47,169 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2157606.0, ans=0.125 2023-06-28 22:13:55,098 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=2157606.0, ans=0.0 2023-06-28 22:14:28,375 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=2157726.0, ans=0.125 2023-06-28 22:14:56,853 INFO [train.py:996] (2/4) Epoch 12, batch 24200, loss[loss=0.2132, simple_loss=0.3279, pruned_loss=0.04926, over 19836.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.3023, pruned_loss=0.07265, over 4291269.10 frames. ], batch size: 702, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 22:15:11,436 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=2157846.0, ans=0.125 2023-06-28 22:16:02,016 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2158026.0, ans=0.1 2023-06-28 22:16:02,145 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2158026.0, ans=0.125 2023-06-28 22:16:08,909 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=2158026.0, ans=0.2 2023-06-28 22:16:26,201 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=2158086.0, ans=0.125 2023-06-28 22:16:47,974 INFO [train.py:996] (2/4) Epoch 12, batch 24250, loss[loss=0.175, simple_loss=0.2786, pruned_loss=0.03575, over 21669.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.2993, pruned_loss=0.06786, over 4283744.99 frames. ], batch size: 414, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 22:17:17,819 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.733e+02 8.184e+02 1.120e+03 1.541e+03 3.593e+03, threshold=2.240e+03, percent-clipped=10.0 2023-06-28 22:18:23,512 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=2158386.0, ans=0.2 2023-06-28 22:18:31,293 INFO [train.py:996] (2/4) Epoch 12, batch 24300, loss[loss=0.0969, simple_loss=0.1598, pruned_loss=0.017, over 16318.00 frames. ], tot_loss[loss=0.208, simple_loss=0.292, pruned_loss=0.06196, over 4273601.89 frames. ], batch size: 60, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 22:18:54,386 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 22:18:59,466 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=2158506.0, ans=0.0 2023-06-28 22:19:14,556 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=2158566.0, ans=0.025 2023-06-28 22:20:13,772 INFO [train.py:996] (2/4) Epoch 12, batch 24350, loss[loss=0.199, simple_loss=0.2756, pruned_loss=0.06117, over 21671.00 frames. ], tot_loss[loss=0.2057, simple_loss=0.2884, pruned_loss=0.06146, over 4282488.28 frames. ], batch size: 263, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 22:20:38,906 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.237e+02 7.403e+02 1.076e+03 1.597e+03 3.002e+03, threshold=2.153e+03, percent-clipped=3.0 2023-06-28 22:20:47,892 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=2158806.0, ans=0.2 2023-06-28 22:21:08,731 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.25 vs. limit=15.0 2023-06-28 22:21:52,117 INFO [train.py:996] (2/4) Epoch 12, batch 24400, loss[loss=0.2087, simple_loss=0.3046, pruned_loss=0.0564, over 16888.00 frames. ], tot_loss[loss=0.2103, simple_loss=0.2924, pruned_loss=0.06407, over 4282232.96 frames. ], batch size: 60, lr: 2.39e-03, grad_scale: 32.0 2023-06-28 22:21:52,942 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2159046.0, ans=0.125 2023-06-28 22:22:02,190 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2159046.0, ans=0.125 2023-06-28 22:22:15,987 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=2159106.0, ans=0.125 2023-06-28 22:22:49,272 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2159166.0, ans=0.1 2023-06-28 22:23:29,782 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.03 vs. limit=15.0 2023-06-28 22:23:39,917 INFO [train.py:996] (2/4) Epoch 12, batch 24450, loss[loss=0.2555, simple_loss=0.3582, pruned_loss=0.07643, over 21245.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.2949, pruned_loss=0.06564, over 4283893.98 frames. ], batch size: 549, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 22:24:01,304 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.221e+02 9.735e+02 1.433e+03 2.433e+03 5.313e+03, threshold=2.865e+03, percent-clipped=29.0 2023-06-28 22:24:23,971 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2159466.0, ans=0.125 2023-06-28 22:24:34,164 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.80 vs. limit=15.0 2023-06-28 22:25:01,615 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2159586.0, ans=0.1 2023-06-28 22:25:22,565 INFO [train.py:996] (2/4) Epoch 12, batch 24500, loss[loss=0.1987, simple_loss=0.2531, pruned_loss=0.07218, over 20250.00 frames. ], tot_loss[loss=0.2142, simple_loss=0.2956, pruned_loss=0.06641, over 4283830.94 frames. ], batch size: 703, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 22:25:28,012 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=2159646.0, ans=0.2 2023-06-28 22:25:48,205 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=2159706.0, ans=10.0 2023-06-28 22:25:54,158 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.47 vs. limit=22.5 2023-06-28 22:25:58,637 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2159766.0, ans=0.125 2023-06-28 22:26:39,749 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.48 vs. limit=15.0 2023-06-28 22:26:57,632 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.03 vs. limit=12.0 2023-06-28 22:27:03,643 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=2159946.0, ans=0.125 2023-06-28 22:27:04,767 INFO [train.py:996] (2/4) Epoch 12, batch 24550, loss[loss=0.2603, simple_loss=0.3358, pruned_loss=0.09246, over 21275.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2975, pruned_loss=0.06768, over 4283598.23 frames. ], batch size: 176, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 22:27:28,296 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.692e+02 8.632e+02 1.069e+03 1.677e+03 3.577e+03, threshold=2.139e+03, percent-clipped=6.0 2023-06-28 22:27:44,600 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.79 vs. limit=15.0 2023-06-28 22:28:05,013 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=2160066.0, ans=0.0 2023-06-28 22:28:24,585 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=2160126.0, ans=0.07 2023-06-28 22:28:28,041 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=2160126.0, ans=0.125 2023-06-28 22:28:48,546 INFO [train.py:996] (2/4) Epoch 12, batch 24600, loss[loss=0.2299, simple_loss=0.2847, pruned_loss=0.08748, over 21261.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2945, pruned_loss=0.06799, over 4273770.99 frames. ], batch size: 471, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 22:28:50,021 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.57 vs. limit=15.0 2023-06-28 22:28:56,102 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2160246.0, ans=0.0 2023-06-28 22:29:09,770 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2160306.0, ans=0.125 2023-06-28 22:29:49,316 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.86 vs. limit=12.0 2023-06-28 22:29:58,786 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=2160426.0, ans=0.125 2023-06-28 22:30:10,466 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2160426.0, ans=0.125 2023-06-28 22:30:32,041 INFO [train.py:996] (2/4) Epoch 12, batch 24650, loss[loss=0.1933, simple_loss=0.2676, pruned_loss=0.05946, over 21591.00 frames. ], tot_loss[loss=0.212, simple_loss=0.2884, pruned_loss=0.06781, over 4266832.36 frames. ], batch size: 247, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 22:30:53,468 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.507e+02 9.210e+02 1.420e+03 2.040e+03 4.110e+03, threshold=2.841e+03, percent-clipped=23.0 2023-06-28 22:31:08,180 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=2160606.0, ans=15.0 2023-06-28 22:31:25,342 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2160666.0, ans=0.125 2023-06-28 22:31:25,804 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.46 vs. limit=15.0 2023-06-28 22:31:46,307 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 22:32:04,640 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2160786.0, ans=0.0 2023-06-28 22:32:13,590 INFO [train.py:996] (2/4) Epoch 12, batch 24700, loss[loss=0.1778, simple_loss=0.2458, pruned_loss=0.05491, over 21595.00 frames. ], tot_loss[loss=0.2097, simple_loss=0.2853, pruned_loss=0.06702, over 4269857.94 frames. ], batch size: 247, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 22:33:50,447 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=2161086.0, ans=0.2 2023-06-28 22:33:54,744 INFO [train.py:996] (2/4) Epoch 12, batch 24750, loss[loss=0.1912, simple_loss=0.2585, pruned_loss=0.0619, over 21618.00 frames. ], tot_loss[loss=0.2037, simple_loss=0.279, pruned_loss=0.0642, over 4264454.79 frames. ], batch size: 415, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 22:33:56,773 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2161146.0, ans=0.0 2023-06-28 22:34:01,862 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2161146.0, ans=0.0 2023-06-28 22:34:15,910 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.049e+02 6.504e+02 9.325e+02 1.249e+03 2.794e+03, threshold=1.865e+03, percent-clipped=0.0 2023-06-28 22:34:32,620 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 22:35:31,044 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2161386.0, ans=0.0 2023-06-28 22:35:34,175 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=2161446.0, ans=0.0 2023-06-28 22:35:35,294 INFO [train.py:996] (2/4) Epoch 12, batch 24800, loss[loss=0.1663, simple_loss=0.2238, pruned_loss=0.05443, over 20735.00 frames. ], tot_loss[loss=0.2, simple_loss=0.2735, pruned_loss=0.06328, over 4264600.62 frames. ], batch size: 609, lr: 2.39e-03, grad_scale: 32.0 2023-06-28 22:36:30,046 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2161566.0, ans=0.125 2023-06-28 22:36:33,623 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2161566.0, ans=0.125 2023-06-28 22:36:41,809 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2161626.0, ans=0.0 2023-06-28 22:36:43,495 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2161626.0, ans=0.1 2023-06-28 22:37:03,228 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2161686.0, ans=0.0 2023-06-28 22:37:18,786 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.23 vs. limit=15.0 2023-06-28 22:37:19,242 INFO [train.py:996] (2/4) Epoch 12, batch 24850, loss[loss=0.1824, simple_loss=0.2583, pruned_loss=0.05327, over 21646.00 frames. ], tot_loss[loss=0.2025, simple_loss=0.275, pruned_loss=0.065, over 4275843.86 frames. ], batch size: 263, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 22:37:42,811 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.956e+02 8.268e+02 1.225e+03 1.737e+03 3.601e+03, threshold=2.449e+03, percent-clipped=20.0 2023-06-28 22:39:01,952 INFO [train.py:996] (2/4) Epoch 12, batch 24900, loss[loss=0.2708, simple_loss=0.3538, pruned_loss=0.09388, over 21844.00 frames. ], tot_loss[loss=0.2052, simple_loss=0.2786, pruned_loss=0.06583, over 4278593.59 frames. ], batch size: 118, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 22:39:29,616 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2162106.0, ans=0.1 2023-06-28 22:39:44,789 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2162166.0, ans=0.125 2023-06-28 22:40:17,987 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=2162226.0, ans=0.2 2023-06-28 22:40:35,273 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2162286.0, ans=0.125 2023-06-28 22:40:46,449 INFO [train.py:996] (2/4) Epoch 12, batch 24950, loss[loss=0.2541, simple_loss=0.3311, pruned_loss=0.0885, over 21447.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.2875, pruned_loss=0.07013, over 4277235.87 frames. ], batch size: 131, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 22:41:14,203 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2162406.0, ans=0.0 2023-06-28 22:41:17,543 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=2162406.0, ans=0.07 2023-06-28 22:41:20,556 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.276e+02 8.663e+02 1.354e+03 1.983e+03 3.739e+03, threshold=2.709e+03, percent-clipped=10.0 2023-06-28 22:41:36,641 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=2162466.0, ans=0.2 2023-06-28 22:41:43,498 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=2162466.0, ans=0.1 2023-06-28 22:41:48,474 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2162466.0, ans=0.0 2023-06-28 22:42:10,518 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=2162586.0, ans=0.2 2023-06-28 22:42:31,486 INFO [train.py:996] (2/4) Epoch 12, batch 25000, loss[loss=0.2229, simple_loss=0.2912, pruned_loss=0.0773, over 21647.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2928, pruned_loss=0.07121, over 4279717.06 frames. ], batch size: 441, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 22:44:03,387 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=2162886.0, ans=0.035 2023-06-28 22:44:12,541 INFO [train.py:996] (2/4) Epoch 12, batch 25050, loss[loss=0.1875, simple_loss=0.2528, pruned_loss=0.06113, over 21765.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.2858, pruned_loss=0.06985, over 4278283.73 frames. ], batch size: 317, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 22:44:49,733 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.916e+02 6.443e+02 9.220e+02 1.309e+03 4.556e+03, threshold=1.844e+03, percent-clipped=4.0 2023-06-28 22:45:24,901 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2163126.0, ans=0.1 2023-06-28 22:45:25,519 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.38 vs. limit=15.0 2023-06-28 22:45:28,994 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.53 vs. limit=6.0 2023-06-28 22:45:54,126 INFO [train.py:996] (2/4) Epoch 12, batch 25100, loss[loss=0.2009, simple_loss=0.2663, pruned_loss=0.06779, over 21743.00 frames. ], tot_loss[loss=0.209, simple_loss=0.2807, pruned_loss=0.06865, over 4264373.77 frames. ], batch size: 124, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 22:46:39,013 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=2163306.0, ans=0.125 2023-06-28 22:46:45,359 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=2163366.0, ans=0.0 2023-06-28 22:47:20,195 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.28 vs. limit=15.0 2023-06-28 22:47:30,134 INFO [train.py:996] (2/4) Epoch 12, batch 25150, loss[loss=0.1976, simple_loss=0.2989, pruned_loss=0.0481, over 21781.00 frames. ], tot_loss[loss=0.2089, simple_loss=0.2854, pruned_loss=0.06618, over 4271891.16 frames. ], batch size: 282, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 22:47:45,874 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=2163546.0, ans=0.0 2023-06-28 22:47:48,185 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.15 vs. limit=15.0 2023-06-28 22:48:07,816 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.073e+02 7.241e+02 9.101e+02 1.469e+03 3.331e+03, threshold=1.820e+03, percent-clipped=11.0 2023-06-28 22:48:10,188 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=2163606.0, ans=0.5 2023-06-28 22:48:23,488 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2163666.0, ans=0.1 2023-06-28 22:48:27,346 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=2163666.0, ans=22.5 2023-06-28 22:48:45,637 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.01 vs. limit=22.5 2023-06-28 22:49:07,972 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2163786.0, ans=0.0 2023-06-28 22:49:12,471 INFO [train.py:996] (2/4) Epoch 12, batch 25200, loss[loss=0.1989, simple_loss=0.2722, pruned_loss=0.06277, over 21261.00 frames. ], tot_loss[loss=0.2068, simple_loss=0.2849, pruned_loss=0.06438, over 4242934.89 frames. ], batch size: 159, lr: 2.39e-03, grad_scale: 32.0 2023-06-28 22:50:05,376 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2163966.0, ans=0.125 2023-06-28 22:50:21,785 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2164026.0, ans=0.0 2023-06-28 22:50:38,529 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=2164086.0, ans=0.0 2023-06-28 22:50:54,693 INFO [train.py:996] (2/4) Epoch 12, batch 25250, loss[loss=0.208, simple_loss=0.2818, pruned_loss=0.06715, over 21610.00 frames. ], tot_loss[loss=0.2038, simple_loss=0.2824, pruned_loss=0.06263, over 4253401.43 frames. ], batch size: 332, lr: 2.38e-03, grad_scale: 32.0 2023-06-28 22:51:04,907 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2164146.0, ans=0.125 2023-06-28 22:51:07,213 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.15 vs. limit=15.0 2023-06-28 22:51:33,366 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.757e+02 8.180e+02 1.142e+03 1.720e+03 2.915e+03, threshold=2.285e+03, percent-clipped=21.0 2023-06-28 22:52:20,052 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=2164386.0, ans=0.125 2023-06-28 22:52:23,662 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=2164386.0, ans=0.0 2023-06-28 22:52:27,136 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2164386.0, ans=0.1 2023-06-28 22:52:36,304 INFO [train.py:996] (2/4) Epoch 12, batch 25300, loss[loss=0.2348, simple_loss=0.3125, pruned_loss=0.07858, over 21318.00 frames. ], tot_loss[loss=0.2033, simple_loss=0.2807, pruned_loss=0.06294, over 4245074.19 frames. ], batch size: 548, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 22:52:44,732 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=2164446.0, ans=0.125 2023-06-28 22:52:51,205 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=2164446.0, ans=10.0 2023-06-28 22:53:09,133 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2164506.0, ans=0.0 2023-06-28 22:53:38,384 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2164566.0, ans=0.125 2023-06-28 22:53:38,976 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.62 vs. limit=15.0 2023-06-28 22:54:21,643 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.52 vs. limit=15.0 2023-06-28 22:54:22,193 INFO [train.py:996] (2/4) Epoch 12, batch 25350, loss[loss=0.1664, simple_loss=0.2603, pruned_loss=0.03629, over 21777.00 frames. ], tot_loss[loss=0.2039, simple_loss=0.2828, pruned_loss=0.06247, over 4248442.40 frames. ], batch size: 282, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 22:54:22,747 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2164746.0, ans=0.1 2023-06-28 22:54:30,922 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2164746.0, ans=0.125 2023-06-28 22:54:30,986 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=2164746.0, ans=0.0 2023-06-28 22:54:55,964 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.352e+02 8.129e+02 1.301e+03 1.964e+03 4.138e+03, threshold=2.601e+03, percent-clipped=21.0 2023-06-28 22:55:20,917 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2164866.0, ans=0.125 2023-06-28 22:55:29,558 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.04 vs. limit=15.0 2023-06-28 22:55:57,310 INFO [train.py:996] (2/4) Epoch 12, batch 25400, loss[loss=0.1828, simple_loss=0.2527, pruned_loss=0.05646, over 21878.00 frames. ], tot_loss[loss=0.2002, simple_loss=0.2779, pruned_loss=0.06126, over 4248558.62 frames. ], batch size: 373, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 22:56:13,834 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2165046.0, ans=0.125 2023-06-28 22:57:18,209 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2165226.0, ans=0.0 2023-06-28 22:57:37,576 INFO [train.py:996] (2/4) Epoch 12, batch 25450, loss[loss=0.2621, simple_loss=0.3253, pruned_loss=0.0995, over 21610.00 frames. ], tot_loss[loss=0.2021, simple_loss=0.279, pruned_loss=0.06264, over 4260885.74 frames. ], batch size: 471, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 22:57:53,478 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=2165346.0, ans=0.0 2023-06-28 22:57:53,486 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=2165346.0, ans=0.2 2023-06-28 22:58:04,480 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=2165406.0, ans=0.125 2023-06-28 22:58:11,895 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.754e+02 9.465e+02 1.393e+03 2.029e+03 3.933e+03, threshold=2.786e+03, percent-clipped=12.0 2023-06-28 22:59:25,650 INFO [train.py:996] (2/4) Epoch 12, batch 25500, loss[loss=0.2176, simple_loss=0.3119, pruned_loss=0.06165, over 21884.00 frames. ], tot_loss[loss=0.1998, simple_loss=0.2794, pruned_loss=0.0601, over 4252779.85 frames. ], batch size: 372, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 22:59:58,641 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2165706.0, ans=0.125 2023-06-28 23:00:03,574 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2165706.0, ans=0.0 2023-06-28 23:00:20,294 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2165766.0, ans=0.125 2023-06-28 23:01:09,409 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2165886.0, ans=0.1 2023-06-28 23:01:12,021 INFO [train.py:996] (2/4) Epoch 12, batch 25550, loss[loss=0.2314, simple_loss=0.3292, pruned_loss=0.06679, over 21669.00 frames. ], tot_loss[loss=0.2029, simple_loss=0.2859, pruned_loss=0.05995, over 4254156.40 frames. ], batch size: 441, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:01:19,561 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2165946.0, ans=0.125 2023-06-28 23:01:21,189 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2165946.0, ans=0.125 2023-06-28 23:01:21,847 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.62 vs. limit=6.0 2023-06-28 23:01:35,948 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=2166006.0, ans=0.125 2023-06-28 23:01:36,767 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=15.73 vs. limit=15.0 2023-06-28 23:01:46,810 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.170e+02 8.424e+02 1.256e+03 1.965e+03 3.448e+03, threshold=2.512e+03, percent-clipped=4.0 2023-06-28 23:02:27,636 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.50 vs. limit=15.0 2023-06-28 23:02:33,618 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2166186.0, ans=0.125 2023-06-28 23:02:58,675 INFO [train.py:996] (2/4) Epoch 12, batch 25600, loss[loss=0.2424, simple_loss=0.3261, pruned_loss=0.07935, over 21440.00 frames. ], tot_loss[loss=0.2057, simple_loss=0.2895, pruned_loss=0.0609, over 4262395.38 frames. ], batch size: 131, lr: 2.38e-03, grad_scale: 32.0 2023-06-28 23:03:22,079 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2166306.0, ans=0.125 2023-06-28 23:04:16,114 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=2166426.0, ans=0.125 2023-06-28 23:04:39,577 INFO [train.py:996] (2/4) Epoch 12, batch 25650, loss[loss=0.1916, simple_loss=0.2533, pruned_loss=0.06491, over 21655.00 frames. ], tot_loss[loss=0.2086, simple_loss=0.29, pruned_loss=0.06359, over 4261000.35 frames. ], batch size: 247, lr: 2.38e-03, grad_scale: 32.0 2023-06-28 23:04:58,058 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2166546.0, ans=0.1 2023-06-28 23:05:02,773 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2166606.0, ans=0.125 2023-06-28 23:05:10,124 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.401e+02 8.404e+02 1.162e+03 1.787e+03 4.210e+03, threshold=2.325e+03, percent-clipped=7.0 2023-06-28 23:05:17,464 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=2166666.0, ans=0.125 2023-06-28 23:05:43,101 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=2166726.0, ans=0.07 2023-06-28 23:06:19,767 INFO [train.py:996] (2/4) Epoch 12, batch 25700, loss[loss=0.2256, simple_loss=0.3082, pruned_loss=0.07152, over 21650.00 frames. ], tot_loss[loss=0.2087, simple_loss=0.2883, pruned_loss=0.06455, over 4251598.43 frames. ], batch size: 389, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:06:58,042 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=2166966.0, ans=0.125 2023-06-28 23:07:58,205 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=2167086.0, ans=0.125 2023-06-28 23:08:07,858 INFO [train.py:996] (2/4) Epoch 12, batch 25750, loss[loss=0.226, simple_loss=0.3083, pruned_loss=0.07182, over 21599.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2928, pruned_loss=0.06793, over 4256880.02 frames. ], batch size: 263, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:08:18,918 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2167146.0, ans=0.1 2023-06-28 23:08:19,007 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2167146.0, ans=0.125 2023-06-28 23:08:24,566 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.75 vs. limit=6.0 2023-06-28 23:08:25,830 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2167206.0, ans=0.125 2023-06-28 23:08:39,821 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.094e+02 7.467e+02 1.127e+03 1.693e+03 5.779e+03, threshold=2.254e+03, percent-clipped=13.0 2023-06-28 23:08:43,633 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=2167206.0, ans=0.125 2023-06-28 23:08:57,613 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.16 vs. limit=6.0 2023-06-28 23:09:42,101 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=2167386.0, ans=0.0 2023-06-28 23:09:43,657 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2167386.0, ans=0.0 2023-06-28 23:09:51,799 INFO [train.py:996] (2/4) Epoch 12, batch 25800, loss[loss=0.2398, simple_loss=0.3337, pruned_loss=0.07297, over 21850.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.3037, pruned_loss=0.07193, over 4258205.27 frames. ], batch size: 124, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:09:52,607 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=2167446.0, ans=0.2 2023-06-28 23:10:25,727 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.04 vs. limit=5.0 2023-06-28 23:10:43,292 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2167566.0, ans=0.0 2023-06-28 23:11:19,913 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.78 vs. limit=15.0 2023-06-28 23:11:33,287 INFO [train.py:996] (2/4) Epoch 12, batch 25850, loss[loss=0.2086, simple_loss=0.2832, pruned_loss=0.06699, over 21496.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.3054, pruned_loss=0.07156, over 4265994.54 frames. ], batch size: 211, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:12:08,744 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.172e+02 7.676e+02 1.093e+03 1.751e+03 3.507e+03, threshold=2.187e+03, percent-clipped=11.0 2023-06-28 23:12:24,039 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=2167866.0, ans=0.0 2023-06-28 23:12:56,277 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.91 vs. limit=15.0 2023-06-28 23:13:03,836 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2167986.0, ans=0.125 2023-06-28 23:13:23,853 INFO [train.py:996] (2/4) Epoch 12, batch 25900, loss[loss=0.2646, simple_loss=0.3559, pruned_loss=0.08662, over 21760.00 frames. ], tot_loss[loss=0.225, simple_loss=0.3065, pruned_loss=0.07173, over 4267362.54 frames. ], batch size: 414, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:14:45,524 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2168286.0, ans=0.1 2023-06-28 23:15:07,745 INFO [train.py:996] (2/4) Epoch 12, batch 25950, loss[loss=0.2481, simple_loss=0.3602, pruned_loss=0.06801, over 20847.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.3125, pruned_loss=0.07419, over 4269049.31 frames. ], batch size: 607, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:15:43,675 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.181e+02 7.724e+02 1.093e+03 1.792e+03 4.212e+03, threshold=2.186e+03, percent-clipped=19.0 2023-06-28 23:15:50,736 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=2168466.0, ans=0.0 2023-06-28 23:16:54,130 INFO [train.py:996] (2/4) Epoch 12, batch 26000, loss[loss=0.2873, simple_loss=0.3641, pruned_loss=0.1052, over 21396.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3135, pruned_loss=0.0734, over 4266491.66 frames. ], batch size: 507, lr: 2.38e-03, grad_scale: 32.0 2023-06-28 23:17:54,256 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=2168826.0, ans=0.0 2023-06-28 23:17:59,437 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.33 vs. limit=12.0 2023-06-28 23:18:36,004 INFO [train.py:996] (2/4) Epoch 12, batch 26050, loss[loss=0.214, simple_loss=0.2857, pruned_loss=0.07118, over 21236.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.313, pruned_loss=0.0742, over 4272404.96 frames. ], batch size: 143, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:19:08,204 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.201e+02 7.488e+02 9.436e+02 1.230e+03 3.511e+03, threshold=1.887e+03, percent-clipped=1.0 2023-06-28 23:19:08,533 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=2169006.0, ans=0.015 2023-06-28 23:19:54,761 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2169186.0, ans=0.125 2023-06-28 23:20:16,748 INFO [train.py:996] (2/4) Epoch 12, batch 26100, loss[loss=0.1924, simple_loss=0.2619, pruned_loss=0.06143, over 21828.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.3067, pruned_loss=0.07341, over 4282424.23 frames. ], batch size: 247, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:20:56,718 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 23:21:39,106 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2169486.0, ans=0.0 2023-06-28 23:21:50,248 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=2169486.0, ans=0.125 2023-06-28 23:22:03,448 INFO [train.py:996] (2/4) Epoch 12, batch 26150, loss[loss=0.243, simple_loss=0.319, pruned_loss=0.08349, over 21270.00 frames. ], tot_loss[loss=0.2248, simple_loss=0.3032, pruned_loss=0.07325, over 4285259.32 frames. ], batch size: 159, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:22:08,949 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2169546.0, ans=0.0 2023-06-28 23:22:08,991 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2169546.0, ans=0.125 2023-06-28 23:22:31,480 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.410e+02 8.718e+02 1.214e+03 1.632e+03 3.208e+03, threshold=2.428e+03, percent-clipped=15.0 2023-06-28 23:22:47,140 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2169666.0, ans=0.125 2023-06-28 23:23:43,660 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2169846.0, ans=0.0 2023-06-28 23:23:44,680 INFO [train.py:996] (2/4) Epoch 12, batch 26200, loss[loss=0.2197, simple_loss=0.3267, pruned_loss=0.05632, over 21750.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.3032, pruned_loss=0.07128, over 4284992.90 frames. ], batch size: 298, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:24:06,651 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2169906.0, ans=0.1 2023-06-28 23:24:11,770 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=2169906.0, ans=0.2 2023-06-28 23:24:16,170 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_na.min_abs, batch_count=2169966.0, ans=0.02 2023-06-28 23:24:24,213 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2169966.0, ans=0.1 2023-06-28 23:24:51,876 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=2170026.0, ans=0.2 2023-06-28 23:25:08,567 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=2170086.0, ans=0.2 2023-06-28 23:25:10,273 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2170086.0, ans=0.125 2023-06-28 23:25:25,901 INFO [train.py:996] (2/4) Epoch 12, batch 26250, loss[loss=0.2024, simple_loss=0.2845, pruned_loss=0.06014, over 21378.00 frames. ], tot_loss[loss=0.223, simple_loss=0.3056, pruned_loss=0.07018, over 4284308.09 frames. ], batch size: 159, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:25:52,909 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.372e+02 9.468e+02 1.366e+03 2.102e+03 4.403e+03, threshold=2.732e+03, percent-clipped=13.0 2023-06-28 23:26:24,522 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=2170326.0, ans=0.125 2023-06-28 23:27:01,315 INFO [train.py:996] (2/4) Epoch 12, batch 26300, loss[loss=0.2037, simple_loss=0.2793, pruned_loss=0.06399, over 21804.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.303, pruned_loss=0.07097, over 4294752.20 frames. ], batch size: 247, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:27:19,931 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2170506.0, ans=0.1 2023-06-28 23:27:21,748 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=2170506.0, ans=0.2 2023-06-28 23:27:43,767 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=2170566.0, ans=0.2 2023-06-28 23:28:38,034 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=2170686.0, ans=0.0 2023-06-28 23:28:42,460 INFO [train.py:996] (2/4) Epoch 12, batch 26350, loss[loss=0.2631, simple_loss=0.3325, pruned_loss=0.09679, over 21592.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.3021, pruned_loss=0.07179, over 4297268.75 frames. ], batch size: 415, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:29:19,236 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.451e+02 7.926e+02 1.139e+03 2.111e+03 4.700e+03, threshold=2.277e+03, percent-clipped=11.0 2023-06-28 23:29:20,526 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.69 vs. limit=15.0 2023-06-28 23:29:48,253 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.16 vs. limit=15.0 2023-06-28 23:29:52,754 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2170926.0, ans=0.1 2023-06-28 23:30:13,911 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=2170986.0, ans=0.125 2023-06-28 23:30:23,052 INFO [train.py:996] (2/4) Epoch 12, batch 26400, loss[loss=0.1745, simple_loss=0.2421, pruned_loss=0.05342, over 21630.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.2958, pruned_loss=0.07118, over 4294833.55 frames. ], batch size: 298, lr: 2.38e-03, grad_scale: 32.0 2023-06-28 23:30:37,082 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2171046.0, ans=0.1 2023-06-28 23:31:28,030 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=2171226.0, ans=0.95 2023-06-28 23:32:16,582 INFO [train.py:996] (2/4) Epoch 12, batch 26450, loss[loss=0.2572, simple_loss=0.3589, pruned_loss=0.07779, over 21837.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.2953, pruned_loss=0.07074, over 4284574.54 frames. ], batch size: 372, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:32:23,086 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.91 vs. limit=12.0 2023-06-28 23:32:32,469 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=2171406.0, ans=0.125 2023-06-28 23:32:51,627 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.901e+02 9.721e+02 1.441e+03 2.127e+03 5.226e+03, threshold=2.882e+03, percent-clipped=23.0 2023-06-28 23:32:55,735 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=2171466.0, ans=0.0 2023-06-28 23:32:59,076 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2171466.0, ans=0.0 2023-06-28 23:33:18,383 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 23:33:59,930 INFO [train.py:996] (2/4) Epoch 12, batch 26500, loss[loss=0.2193, simple_loss=0.3358, pruned_loss=0.05136, over 20817.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.2976, pruned_loss=0.06914, over 4277322.64 frames. ], batch size: 607, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:34:00,812 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2171646.0, ans=0.125 2023-06-28 23:34:11,379 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.49 vs. limit=15.0 2023-06-28 23:35:47,899 INFO [train.py:996] (2/4) Epoch 12, batch 26550, loss[loss=0.1859, simple_loss=0.272, pruned_loss=0.04988, over 21584.00 frames. ], tot_loss[loss=0.2149, simple_loss=0.2958, pruned_loss=0.06697, over 4265044.29 frames. ], batch size: 230, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:35:52,035 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=2171946.0, ans=0.0 2023-06-28 23:36:23,125 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.969e+02 7.971e+02 1.184e+03 2.245e+03 4.419e+03, threshold=2.369e+03, percent-clipped=15.0 2023-06-28 23:37:08,410 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2172186.0, ans=0.125 2023-06-28 23:37:28,575 INFO [train.py:996] (2/4) Epoch 12, batch 26600, loss[loss=0.1871, simple_loss=0.266, pruned_loss=0.05404, over 21623.00 frames. ], tot_loss[loss=0.2106, simple_loss=0.2936, pruned_loss=0.06378, over 4264073.53 frames. ], batch size: 247, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:37:40,324 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 23:38:09,691 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=2172366.0, ans=0.2 2023-06-28 23:38:39,346 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.98 vs. limit=15.0 2023-06-28 23:38:40,224 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=2172426.0, ans=0.125 2023-06-28 23:39:08,226 INFO [train.py:996] (2/4) Epoch 12, batch 26650, loss[loss=0.1471, simple_loss=0.2292, pruned_loss=0.03247, over 21518.00 frames. ], tot_loss[loss=0.206, simple_loss=0.2868, pruned_loss=0.06266, over 4265803.80 frames. ], batch size: 195, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:39:08,806 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2172546.0, ans=0.125 2023-06-28 23:39:24,816 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=2172546.0, ans=0.05 2023-06-28 23:39:32,822 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2172606.0, ans=0.125 2023-06-28 23:39:46,395 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.698e+02 6.768e+02 8.885e+02 1.234e+03 3.430e+03, threshold=1.777e+03, percent-clipped=1.0 2023-06-28 23:40:04,428 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=2172666.0, ans=0.125 2023-06-28 23:40:12,362 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2172726.0, ans=0.0 2023-06-28 23:40:39,983 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2172786.0, ans=0.125 2023-06-28 23:40:42,638 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.28 vs. limit=8.0 2023-06-28 23:40:52,089 INFO [train.py:996] (2/4) Epoch 12, batch 26700, loss[loss=0.2074, simple_loss=0.2717, pruned_loss=0.07153, over 21596.00 frames. ], tot_loss[loss=0.2015, simple_loss=0.2813, pruned_loss=0.06083, over 4266559.06 frames. ], batch size: 548, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:40:57,906 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=2172846.0, ans=0.07 2023-06-28 23:41:08,782 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=11.12 vs. limit=15.0 2023-06-28 23:41:19,138 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=2172906.0, ans=0.0 2023-06-28 23:41:49,253 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.95 vs. limit=15.0 2023-06-28 23:42:00,021 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2173026.0, ans=0.125 2023-06-28 23:42:01,474 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2173026.0, ans=0.125 2023-06-28 23:42:33,656 INFO [train.py:996] (2/4) Epoch 12, batch 26750, loss[loss=0.2157, simple_loss=0.2966, pruned_loss=0.06741, over 21754.00 frames. ], tot_loss[loss=0.1997, simple_loss=0.28, pruned_loss=0.05971, over 4271396.65 frames. ], batch size: 332, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:42:47,234 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=2173146.0, ans=0.125 2023-06-28 23:43:12,967 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.416e+02 8.010e+02 1.094e+03 1.630e+03 3.819e+03, threshold=2.187e+03, percent-clipped=19.0 2023-06-28 23:43:19,513 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.88 vs. limit=22.5 2023-06-28 23:44:20,463 INFO [train.py:996] (2/4) Epoch 12, batch 26800, loss[loss=0.2496, simple_loss=0.3224, pruned_loss=0.08836, over 21819.00 frames. ], tot_loss[loss=0.2067, simple_loss=0.2872, pruned_loss=0.06308, over 4272005.09 frames. ], batch size: 441, lr: 2.38e-03, grad_scale: 32.0 2023-06-28 23:44:43,857 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=2173506.0, ans=0.0 2023-06-28 23:44:49,445 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.07 vs. limit=15.0 2023-06-28 23:46:05,360 INFO [train.py:996] (2/4) Epoch 12, batch 26850, loss[loss=0.2005, simple_loss=0.2715, pruned_loss=0.06471, over 21856.00 frames. ], tot_loss[loss=0.2092, simple_loss=0.2879, pruned_loss=0.06524, over 4273812.37 frames. ], batch size: 118, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:46:12,406 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2173746.0, ans=0.125 2023-06-28 23:46:17,051 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=2173746.0, ans=0.07 2023-06-28 23:46:22,164 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2173806.0, ans=0.125 2023-06-28 23:46:40,819 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.776e+02 8.023e+02 1.160e+03 1.579e+03 4.505e+03, threshold=2.321e+03, percent-clipped=8.0 2023-06-28 23:47:08,251 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=2173926.0, ans=0.09899494936611666 2023-06-28 23:47:14,790 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=2173926.0, ans=0.0 2023-06-28 23:47:32,570 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2173986.0, ans=0.0 2023-06-28 23:47:40,063 INFO [train.py:996] (2/4) Epoch 12, batch 26900, loss[loss=0.186, simple_loss=0.253, pruned_loss=0.05955, over 21540.00 frames. ], tot_loss[loss=0.2046, simple_loss=0.2806, pruned_loss=0.06435, over 4271673.29 frames. ], batch size: 391, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:48:07,971 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.91 vs. limit=15.0 2023-06-28 23:48:28,869 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.49 vs. limit=15.0 2023-06-28 23:48:49,201 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=2174226.0, ans=0.0 2023-06-28 23:49:11,244 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2174286.0, ans=0.125 2023-06-28 23:49:11,267 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=2174286.0, ans=0.125 2023-06-28 23:49:18,474 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.21 vs. limit=22.5 2023-06-28 23:49:19,001 INFO [train.py:996] (2/4) Epoch 12, batch 26950, loss[loss=0.2259, simple_loss=0.3192, pruned_loss=0.06633, over 21825.00 frames. ], tot_loss[loss=0.2041, simple_loss=0.2799, pruned_loss=0.06418, over 4268674.59 frames. ], batch size: 371, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:49:40,871 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2174406.0, ans=0.125 2023-06-28 23:49:44,200 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2174406.0, ans=0.0 2023-06-28 23:49:54,865 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.876e+02 6.986e+02 1.003e+03 1.529e+03 4.492e+03, threshold=2.006e+03, percent-clipped=11.0 2023-06-28 23:50:01,816 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 23:50:08,429 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=2174466.0, ans=0.2 2023-06-28 23:50:26,837 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=2174526.0, ans=0.125 2023-06-28 23:50:35,921 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.14 vs. limit=15.0 2023-06-28 23:50:37,326 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=2174526.0, ans=0.0 2023-06-28 23:51:06,162 INFO [train.py:996] (2/4) Epoch 12, batch 27000, loss[loss=0.1737, simple_loss=0.2713, pruned_loss=0.03802, over 21759.00 frames. ], tot_loss[loss=0.2018, simple_loss=0.2797, pruned_loss=0.06198, over 4275833.97 frames. ], batch size: 282, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:51:06,162 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-28 23:51:22,026 INFO [train.py:1028] (2/4) Epoch 12, validation: loss=0.2512, simple_loss=0.3387, pruned_loss=0.08188, over 1796401.00 frames. 2023-06-28 23:51:22,026 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 23793MB 2023-06-28 23:51:46,535 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2174706.0, ans=0.1 2023-06-28 23:52:48,134 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=2174886.0, ans=0.125 2023-06-28 23:53:01,413 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2174886.0, ans=0.125 2023-06-28 23:53:03,874 INFO [train.py:996] (2/4) Epoch 12, batch 27050, loss[loss=0.2563, simple_loss=0.3331, pruned_loss=0.08978, over 21596.00 frames. ], tot_loss[loss=0.2012, simple_loss=0.283, pruned_loss=0.05964, over 4281892.82 frames. ], batch size: 507, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:53:39,411 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.68 vs. limit=22.5 2023-06-28 23:53:40,446 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=2175006.0, ans=0.125 2023-06-28 23:53:44,983 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.952e+02 1.010e+03 1.463e+03 2.409e+03 4.686e+03, threshold=2.925e+03, percent-clipped=39.0 2023-06-28 23:53:50,735 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2175066.0, ans=0.1 2023-06-28 23:54:38,013 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2175186.0, ans=0.1 2023-06-28 23:54:39,737 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2175186.0, ans=0.125 2023-06-28 23:54:45,816 INFO [train.py:996] (2/4) Epoch 12, batch 27100, loss[loss=0.1931, simple_loss=0.2869, pruned_loss=0.04969, over 21899.00 frames. ], tot_loss[loss=0.2026, simple_loss=0.2848, pruned_loss=0.06017, over 4290155.93 frames. ], batch size: 316, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:55:01,496 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=2175246.0, ans=0.2 2023-06-28 23:55:19,405 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.04 vs. limit=22.5 2023-06-28 23:55:39,136 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2175366.0, ans=0.1 2023-06-28 23:56:34,065 INFO [train.py:996] (2/4) Epoch 12, batch 27150, loss[loss=0.2492, simple_loss=0.3402, pruned_loss=0.07909, over 21820.00 frames. ], tot_loss[loss=0.2115, simple_loss=0.2956, pruned_loss=0.06371, over 4285223.51 frames. ], batch size: 282, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:56:41,478 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2175546.0, ans=0.125 2023-06-28 23:56:45,138 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2175546.0, ans=0.1 2023-06-28 23:57:01,063 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2175606.0, ans=0.125 2023-06-28 23:57:01,119 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2175606.0, ans=0.0 2023-06-28 23:57:14,486 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 6.191e+02 8.496e+02 1.171e+03 1.771e+03 3.313e+03, threshold=2.341e+03, percent-clipped=5.0 2023-06-28 23:57:29,882 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2175666.0, ans=0.0 2023-06-28 23:57:42,763 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2175726.0, ans=0.1 2023-06-28 23:57:52,811 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2175786.0, ans=0.125 2023-06-28 23:58:02,925 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=2175786.0, ans=0.125 2023-06-28 23:58:15,371 INFO [train.py:996] (2/4) Epoch 12, batch 27200, loss[loss=0.314, simple_loss=0.3742, pruned_loss=0.1269, over 21406.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.3033, pruned_loss=0.06627, over 4287025.14 frames. ], batch size: 508, lr: 2.38e-03, grad_scale: 32.0 2023-06-28 23:58:41,550 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2175906.0, ans=0.125 2023-06-28 23:58:41,569 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2175906.0, ans=0.1 2023-06-28 23:59:32,877 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=2176026.0, ans=0.125 2023-06-28 23:59:39,930 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.00 vs. limit=15.0 2023-06-28 23:59:55,667 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=2176086.0, ans=0.2 2023-06-29 00:00:01,698 INFO [train.py:996] (2/4) Epoch 12, batch 27250, loss[loss=0.2154, simple_loss=0.2872, pruned_loss=0.07177, over 21652.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.3061, pruned_loss=0.07039, over 4286059.67 frames. ], batch size: 263, lr: 2.38e-03, grad_scale: 16.0 2023-06-29 00:00:45,294 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.531e+02 9.436e+02 1.424e+03 2.260e+03 4.305e+03, threshold=2.849e+03, percent-clipped=22.0 2023-06-29 00:01:06,539 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.97 vs. limit=15.0 2023-06-29 00:01:27,407 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=2176386.0, ans=0.2 2023-06-29 00:01:49,813 INFO [train.py:996] (2/4) Epoch 12, batch 27300, loss[loss=0.2148, simple_loss=0.3022, pruned_loss=0.06373, over 21781.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.3083, pruned_loss=0.0717, over 4281911.46 frames. ], batch size: 247, lr: 2.38e-03, grad_scale: 16.0 2023-06-29 00:02:50,163 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=2176566.0, ans=0.0 2023-06-29 00:03:11,509 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.72 vs. limit=15.0 2023-06-29 00:03:19,234 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=2176686.0, ans=0.125 2023-06-29 00:03:31,715 INFO [train.py:996] (2/4) Epoch 12, batch 27350, loss[loss=0.2376, simple_loss=0.3193, pruned_loss=0.07791, over 21743.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.3102, pruned_loss=0.07206, over 4284112.33 frames. ], batch size: 441, lr: 2.38e-03, grad_scale: 16.0 2023-06-29 00:03:44,740 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2176746.0, ans=0.0 2023-06-29 00:03:49,736 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2176746.0, ans=0.0 2023-06-29 00:04:02,820 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2176806.0, ans=0.0 2023-06-29 00:04:13,502 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.353e+02 7.469e+02 1.032e+03 1.512e+03 4.171e+03, threshold=2.065e+03, percent-clipped=4.0 2023-06-29 00:04:34,834 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2176926.0, ans=0.0 2023-06-29 00:04:44,344 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.51 vs. limit=15.0 2023-06-29 00:04:59,919 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=2176986.0, ans=0.125 2023-06-29 00:05:01,539 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2176986.0, ans=0.125 2023-06-29 00:05:15,162 INFO [train.py:996] (2/4) Epoch 12, batch 27400, loss[loss=0.1977, simple_loss=0.2659, pruned_loss=0.0647, over 21715.00 frames. ], tot_loss[loss=0.224, simple_loss=0.3052, pruned_loss=0.07141, over 4286851.50 frames. ], batch size: 414, lr: 2.38e-03, grad_scale: 16.0 2023-06-29 00:05:54,725 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2177106.0, ans=0.1 2023-06-29 00:06:43,632 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2177286.0, ans=0.125 2023-06-29 00:06:55,555 INFO [train.py:996] (2/4) Epoch 12, batch 27450, loss[loss=0.2247, simple_loss=0.3048, pruned_loss=0.07227, over 21306.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.2999, pruned_loss=0.06961, over 4279801.49 frames. ], batch size: 548, lr: 2.38e-03, grad_scale: 16.0 2023-06-29 00:07:02,693 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2177346.0, ans=0.1 2023-06-29 00:07:30,074 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2177406.0, ans=0.125 2023-06-29 00:07:32,479 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.358e+02 7.847e+02 1.147e+03 1.584e+03 3.380e+03, threshold=2.294e+03, percent-clipped=11.0 2023-06-29 00:08:31,936 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2177586.0, ans=0.1 2023-06-29 00:08:34,578 INFO [train.py:996] (2/4) Epoch 12, batch 27500, loss[loss=0.1938, simple_loss=0.2707, pruned_loss=0.05841, over 21869.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.2989, pruned_loss=0.07039, over 4291626.23 frames. ], batch size: 298, lr: 2.38e-03, grad_scale: 16.0 2023-06-29 00:09:39,126 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=2177826.0, ans=0.125 2023-06-29 00:09:44,556 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.40 vs. limit=12.0 2023-06-29 00:09:45,603 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=2177826.0, ans=0.125 2023-06-29 00:10:15,303 INFO [train.py:996] (2/4) Epoch 12, batch 27550, loss[loss=0.1845, simple_loss=0.2622, pruned_loss=0.05336, over 21746.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2938, pruned_loss=0.06733, over 4284057.45 frames. ], batch size: 351, lr: 2.38e-03, grad_scale: 16.0 2023-06-29 00:10:17,392 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2177946.0, ans=0.1 2023-06-29 00:10:17,964 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.61 vs. limit=6.0 2023-06-29 00:10:57,005 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.117e+02 1.004e+03 1.516e+03 2.430e+03 4.785e+03, threshold=3.032e+03, percent-clipped=27.0 2023-06-29 00:11:08,502 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2178066.0, ans=0.1 2023-06-29 00:11:19,699 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-29 00:11:29,365 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2178126.0, ans=0.0 2023-06-29 00:11:54,697 INFO [train.py:996] (2/4) Epoch 12, batch 27600, loss[loss=0.1847, simple_loss=0.2524, pruned_loss=0.05848, over 21331.00 frames. ], tot_loss[loss=0.2091, simple_loss=0.2861, pruned_loss=0.06602, over 4286984.16 frames. ], batch size: 211, lr: 2.38e-03, grad_scale: 16.0 2023-06-29 00:11:57,426 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.52 vs. limit=15.0 2023-06-29 00:12:04,054 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.25 vs. limit=8.0 2023-06-29 00:12:41,690 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=2178366.0, ans=0.0 2023-06-29 00:13:04,966 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.63 vs. limit=15.0 2023-06-29 00:13:11,133 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.76 vs. limit=10.0 2023-06-29 00:13:13,942 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=2178486.0, ans=0.125 2023-06-29 00:13:15,527 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2178486.0, ans=0.0 2023-06-29 00:13:28,458 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=2178486.0, ans=0.0 2023-06-29 00:13:34,125 INFO [train.py:996] (2/4) Epoch 12, batch 27650, loss[loss=0.1905, simple_loss=0.2705, pruned_loss=0.05522, over 16751.00 frames. ], tot_loss[loss=0.2055, simple_loss=0.2808, pruned_loss=0.06512, over 4274063.80 frames. ], batch size: 64, lr: 2.38e-03, grad_scale: 16.0 2023-06-29 00:13:51,045 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=2178606.0, ans=0.2 2023-06-29 00:13:58,760 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2178606.0, ans=0.125 2023-06-29 00:14:08,949 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2178606.0, ans=0.125 2023-06-29 00:14:09,067 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=2178606.0, ans=0.125 2023-06-29 00:14:16,788 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2178666.0, ans=0.0 2023-06-29 00:14:17,724 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.954e+02 7.725e+02 1.101e+03 1.627e+03 3.974e+03, threshold=2.201e+03, percent-clipped=3.0 2023-06-29 00:14:20,839 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.96 vs. limit=15.0 2023-06-29 00:14:38,165 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2178726.0, ans=0.1 2023-06-29 00:15:03,045 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2178786.0, ans=0.1 2023-06-29 00:15:15,562 INFO [train.py:996] (2/4) Epoch 12, batch 27700, loss[loss=0.1714, simple_loss=0.2483, pruned_loss=0.04722, over 21891.00 frames. ], tot_loss[loss=0.2052, simple_loss=0.2822, pruned_loss=0.06406, over 4271229.99 frames. ], batch size: 98, lr: 2.38e-03, grad_scale: 16.0 2023-06-29 00:15:17,685 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2178846.0, ans=0.125 2023-06-29 00:15:34,453 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2178846.0, ans=0.1 2023-06-29 00:15:39,151 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2178906.0, ans=0.125 2023-06-29 00:15:47,251 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=2178906.0, ans=0.125 2023-06-29 00:16:17,720 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=2179026.0, ans=0.125 2023-06-29 00:16:56,216 INFO [train.py:996] (2/4) Epoch 12, batch 27750, loss[loss=0.2027, simple_loss=0.2794, pruned_loss=0.06301, over 21865.00 frames. ], tot_loss[loss=0.2076, simple_loss=0.2863, pruned_loss=0.06442, over 4276703.96 frames. ], batch size: 107, lr: 2.38e-03, grad_scale: 16.0 2023-06-29 00:17:19,209 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2179206.0, ans=0.125 2023-06-29 00:17:24,481 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=2179206.0, ans=0.0 2023-06-29 00:17:24,482 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2179206.0, ans=0.0 2023-06-29 00:17:39,853 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 6.037e+02 8.775e+02 1.414e+03 2.124e+03 3.615e+03, threshold=2.828e+03, percent-clipped=21.0 2023-06-29 00:18:00,905 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=2179326.0, ans=0.125 2023-06-29 00:18:35,464 INFO [train.py:996] (2/4) Epoch 12, batch 27800, loss[loss=0.2278, simple_loss=0.2951, pruned_loss=0.08025, over 21765.00 frames. ], tot_loss[loss=0.2076, simple_loss=0.2848, pruned_loss=0.06526, over 4285582.61 frames. ], batch size: 441, lr: 2.38e-03, grad_scale: 16.0 2023-06-29 00:19:45,365 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=2179626.0, ans=0.07 2023-06-29 00:20:16,294 INFO [train.py:996] (2/4) Epoch 12, batch 27850, loss[loss=0.2099, simple_loss=0.2831, pruned_loss=0.0683, over 21583.00 frames. ], tot_loss[loss=0.2074, simple_loss=0.2833, pruned_loss=0.06569, over 4294707.96 frames. ], batch size: 131, lr: 2.38e-03, grad_scale: 16.0 2023-06-29 00:20:21,095 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.30 vs. limit=15.0 2023-06-29 00:20:34,641 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2179746.0, ans=0.1 2023-06-29 00:20:36,148 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=2179806.0, ans=0.125 2023-06-29 00:20:56,125 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=2179806.0, ans=0.2 2023-06-29 00:21:00,409 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.968e+02 8.980e+02 1.586e+03 2.122e+03 3.865e+03, threshold=3.171e+03, percent-clipped=6.0 2023-06-29 00:21:47,395 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2179986.0, ans=0.1 2023-06-29 00:21:58,234 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.55 vs. limit=10.0 2023-06-29 00:22:03,472 INFO [train.py:996] (2/4) Epoch 12, batch 27900, loss[loss=0.2584, simple_loss=0.359, pruned_loss=0.07896, over 21775.00 frames. ], tot_loss[loss=0.2121, simple_loss=0.2915, pruned_loss=0.06633, over 4287096.28 frames. ], batch size: 351, lr: 2.38e-03, grad_scale: 16.0 2023-06-29 00:22:07,332 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=2180046.0, ans=0.125 2023-06-29 00:23:51,647 INFO [train.py:996] (2/4) Epoch 12, batch 27950, loss[loss=0.2091, simple_loss=0.3141, pruned_loss=0.05204, over 21197.00 frames. ], tot_loss[loss=0.2095, simple_loss=0.2917, pruned_loss=0.06364, over 4284520.93 frames. ], batch size: 549, lr: 2.38e-03, grad_scale: 16.0 2023-06-29 00:24:20,304 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=2180406.0, ans=0.125 2023-06-29 00:24:22,541 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.14 vs. limit=12.0 2023-06-29 00:24:35,525 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.187e+02 9.154e+02 1.408e+03 1.897e+03 4.005e+03, threshold=2.816e+03, percent-clipped=4.0 2023-06-29 00:25:16,665 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=2180586.0, ans=0.2 2023-06-29 00:25:31,892 INFO [train.py:996] (2/4) Epoch 12, batch 28000, loss[loss=0.2218, simple_loss=0.3014, pruned_loss=0.07108, over 21796.00 frames. ], tot_loss[loss=0.2069, simple_loss=0.2899, pruned_loss=0.06195, over 4275735.64 frames. ], batch size: 112, lr: 2.38e-03, grad_scale: 32.0 2023-06-29 00:25:35,890 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=2180646.0, ans=0.0 2023-06-29 00:26:44,223 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=2180826.0, ans=0.2 2023-06-29 00:26:52,195 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2180886.0, ans=0.125 2023-06-29 00:26:52,750 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.64 vs. limit=22.5 2023-06-29 00:26:57,853 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.14 vs. limit=15.0 2023-06-29 00:27:15,065 INFO [train.py:996] (2/4) Epoch 12, batch 28050, loss[loss=0.2216, simple_loss=0.3018, pruned_loss=0.07073, over 21840.00 frames. ], tot_loss[loss=0.2075, simple_loss=0.2884, pruned_loss=0.06334, over 4281028.31 frames. ], batch size: 371, lr: 2.38e-03, grad_scale: 16.0 2023-06-29 00:27:51,923 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.07 vs. limit=22.5 2023-06-29 00:28:00,400 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.693e+02 7.704e+02 1.092e+03 1.721e+03 4.655e+03, threshold=2.185e+03, percent-clipped=4.0 2023-06-29 00:28:07,821 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=2181066.0, ans=0.035 2023-06-29 00:28:13,247 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.93 vs. limit=10.0 2023-06-29 00:28:59,176 INFO [train.py:996] (2/4) Epoch 12, batch 28100, loss[loss=0.1991, simple_loss=0.2738, pruned_loss=0.06224, over 21755.00 frames. ], tot_loss[loss=0.2058, simple_loss=0.2859, pruned_loss=0.06289, over 4277112.50 frames. ], batch size: 107, lr: 2.38e-03, grad_scale: 16.0 2023-06-29 00:29:43,662 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2181366.0, ans=0.1 2023-06-29 00:29:48,457 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2181366.0, ans=0.125 2023-06-29 00:30:14,204 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2181426.0, ans=0.0 2023-06-29 00:30:27,214 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=2181486.0, ans=0.125 2023-06-29 00:30:39,425 INFO [train.py:996] (2/4) Epoch 12, batch 28150, loss[loss=0.1817, simple_loss=0.246, pruned_loss=0.05871, over 21582.00 frames. ], tot_loss[loss=0.2027, simple_loss=0.2803, pruned_loss=0.06255, over 4266116.99 frames. ], batch size: 298, lr: 2.38e-03, grad_scale: 16.0 2023-06-29 00:31:20,817 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.946e+02 8.258e+02 1.413e+03 2.441e+03 4.810e+03, threshold=2.825e+03, percent-clipped=31.0 2023-06-29 00:32:20,023 INFO [train.py:996] (2/4) Epoch 12, batch 28200, loss[loss=0.2241, simple_loss=0.2917, pruned_loss=0.07822, over 21289.00 frames. ], tot_loss[loss=0.2038, simple_loss=0.2793, pruned_loss=0.06419, over 4265015.51 frames. ], batch size: 176, lr: 2.38e-03, grad_scale: 16.0 2023-06-29 00:32:23,800 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2181846.0, ans=0.125 2023-06-29 00:32:31,969 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2181846.0, ans=0.125 2023-06-29 00:32:49,908 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2181906.0, ans=0.125 2023-06-29 00:33:10,637 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.88 vs. limit=15.0 2023-06-29 00:33:56,374 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.64 vs. limit=10.0 2023-06-29 00:34:06,090 INFO [train.py:996] (2/4) Epoch 12, batch 28250, loss[loss=0.1953, simple_loss=0.2672, pruned_loss=0.06167, over 21672.00 frames. ], tot_loss[loss=0.2075, simple_loss=0.2826, pruned_loss=0.0662, over 4260949.27 frames. ], batch size: 298, lr: 2.38e-03, grad_scale: 16.0 2023-06-29 00:34:10,161 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2182146.0, ans=0.125 2023-06-29 00:34:14,992 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2182146.0, ans=0.1 2023-06-29 00:34:18,409 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2182146.0, ans=0.125 2023-06-29 00:34:33,976 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=13.73 vs. limit=15.0 2023-06-29 00:34:48,083 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.469e+02 1.185e+03 1.669e+03 2.503e+03 4.651e+03, threshold=3.338e+03, percent-clipped=13.0 2023-06-29 00:35:48,351 INFO [train.py:996] (2/4) Epoch 12, batch 28300, loss[loss=0.1686, simple_loss=0.2676, pruned_loss=0.03483, over 21791.00 frames. ], tot_loss[loss=0.2041, simple_loss=0.28, pruned_loss=0.06414, over 4258355.11 frames. ], batch size: 371, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 00:36:23,775 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2182506.0, ans=0.125 2023-06-29 00:36:40,368 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2182566.0, ans=0.125 2023-06-29 00:37:29,516 INFO [train.py:996] (2/4) Epoch 12, batch 28350, loss[loss=0.1857, simple_loss=0.2621, pruned_loss=0.05459, over 21627.00 frames. ], tot_loss[loss=0.1989, simple_loss=0.278, pruned_loss=0.0599, over 4258382.96 frames. ], batch size: 332, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 00:37:40,143 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=2182746.0, ans=0.2 2023-06-29 00:38:02,876 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2182806.0, ans=0.125 2023-06-29 00:38:15,052 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.438e+02 7.077e+02 1.027e+03 1.914e+03 4.296e+03, threshold=2.054e+03, percent-clipped=2.0 2023-06-29 00:38:57,972 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2182986.0, ans=0.125 2023-06-29 00:39:10,247 INFO [train.py:996] (2/4) Epoch 12, batch 28400, loss[loss=0.2261, simple_loss=0.2996, pruned_loss=0.07634, over 21173.00 frames. ], tot_loss[loss=0.1973, simple_loss=0.2751, pruned_loss=0.05981, over 4252279.82 frames. ], batch size: 143, lr: 2.37e-03, grad_scale: 32.0 2023-06-29 00:39:18,304 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=2183046.0, ans=0.0 2023-06-29 00:40:52,143 INFO [train.py:996] (2/4) Epoch 12, batch 28450, loss[loss=0.2096, simple_loss=0.2814, pruned_loss=0.06892, over 21832.00 frames. ], tot_loss[loss=0.2036, simple_loss=0.2803, pruned_loss=0.06344, over 4257459.04 frames. ], batch size: 282, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 00:41:18,715 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=2183406.0, ans=0.04949747468305833 2023-06-29 00:41:18,748 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=2183406.0, ans=0.025 2023-06-29 00:41:43,665 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.501e+02 7.826e+02 1.097e+03 1.608e+03 4.884e+03, threshold=2.195e+03, percent-clipped=11.0 2023-06-29 00:41:58,452 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=2183526.0, ans=0.125 2023-06-29 00:42:24,427 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.16 vs. limit=15.0 2023-06-29 00:42:38,367 INFO [train.py:996] (2/4) Epoch 12, batch 28500, loss[loss=0.1843, simple_loss=0.2425, pruned_loss=0.06301, over 20127.00 frames. ], tot_loss[loss=0.2053, simple_loss=0.2813, pruned_loss=0.0647, over 4265053.99 frames. ], batch size: 703, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 00:42:47,874 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.38 vs. limit=15.0 2023-06-29 00:42:49,112 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2183646.0, ans=0.1 2023-06-29 00:42:50,824 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2183646.0, ans=0.125 2023-06-29 00:43:08,798 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2183706.0, ans=0.125 2023-06-29 00:43:25,407 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=2183766.0, ans=0.0 2023-06-29 00:44:05,349 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=2183886.0, ans=0.125 2023-06-29 00:44:05,408 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2183886.0, ans=0.0 2023-06-29 00:44:15,467 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=2183886.0, ans=0.125 2023-06-29 00:44:15,472 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=2183886.0, ans=0.125 2023-06-29 00:44:21,536 INFO [train.py:996] (2/4) Epoch 12, batch 28550, loss[loss=0.2213, simple_loss=0.3199, pruned_loss=0.06133, over 21289.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.2909, pruned_loss=0.06776, over 4276127.50 frames. ], batch size: 176, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 00:44:32,304 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2183946.0, ans=0.125 2023-06-29 00:44:38,775 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2183946.0, ans=0.1 2023-06-29 00:44:43,699 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2184006.0, ans=0.0 2023-06-29 00:44:45,572 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=2184006.0, ans=0.125 2023-06-29 00:45:07,018 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.11 vs. limit=10.0 2023-06-29 00:45:12,486 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.914e+02 9.584e+02 1.430e+03 2.076e+03 4.050e+03, threshold=2.859e+03, percent-clipped=23.0 2023-06-29 00:45:44,238 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2184186.0, ans=0.1 2023-06-29 00:46:00,127 INFO [train.py:996] (2/4) Epoch 12, batch 28600, loss[loss=0.2134, simple_loss=0.2916, pruned_loss=0.06755, over 20661.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.2969, pruned_loss=0.06921, over 4276777.36 frames. ], batch size: 607, lr: 2.37e-03, grad_scale: 8.0 2023-06-29 00:46:07,457 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2184246.0, ans=0.125 2023-06-29 00:46:10,665 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2184246.0, ans=0.125 2023-06-29 00:46:14,862 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.19 vs. limit=15.0 2023-06-29 00:46:44,731 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=2184366.0, ans=0.2 2023-06-29 00:47:25,844 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=2184486.0, ans=0.2 2023-06-29 00:47:45,763 INFO [train.py:996] (2/4) Epoch 12, batch 28650, loss[loss=0.2213, simple_loss=0.2847, pruned_loss=0.07901, over 20146.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.2909, pruned_loss=0.06846, over 4276852.35 frames. ], batch size: 702, lr: 2.37e-03, grad_scale: 8.0 2023-06-29 00:47:59,503 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2184546.0, ans=0.125 2023-06-29 00:48:13,046 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.85 vs. limit=15.0 2023-06-29 00:48:13,073 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.95 vs. limit=22.5 2023-06-29 00:48:30,054 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.107e+02 8.338e+02 1.219e+03 1.644e+03 3.488e+03, threshold=2.437e+03, percent-clipped=4.0 2023-06-29 00:49:26,536 INFO [train.py:996] (2/4) Epoch 12, batch 28700, loss[loss=0.2257, simple_loss=0.3068, pruned_loss=0.07228, over 21800.00 frames. ], tot_loss[loss=0.2147, simple_loss=0.2899, pruned_loss=0.06979, over 4273958.41 frames. ], batch size: 351, lr: 2.37e-03, grad_scale: 8.0 2023-06-29 00:50:20,610 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=2185026.0, ans=0.025 2023-06-29 00:51:06,117 INFO [train.py:996] (2/4) Epoch 12, batch 28750, loss[loss=0.2167, simple_loss=0.2927, pruned_loss=0.07034, over 21860.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2902, pruned_loss=0.07011, over 4272570.61 frames. ], batch size: 124, lr: 2.37e-03, grad_scale: 8.0 2023-06-29 00:51:16,092 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=2185146.0, ans=0.09899494936611666 2023-06-29 00:51:16,135 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2185146.0, ans=0.125 2023-06-29 00:51:36,287 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2185206.0, ans=0.125 2023-06-29 00:51:50,492 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.316e+02 7.873e+02 1.096e+03 1.643e+03 3.604e+03, threshold=2.192e+03, percent-clipped=9.0 2023-06-29 00:52:35,306 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=2185386.0, ans=0.0 2023-06-29 00:52:47,780 INFO [train.py:996] (2/4) Epoch 12, batch 28800, loss[loss=0.2369, simple_loss=0.3113, pruned_loss=0.08122, over 21890.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.2935, pruned_loss=0.07001, over 4275845.20 frames. ], batch size: 371, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 00:53:02,407 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=9.27 vs. limit=15.0 2023-06-29 00:53:11,231 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-29 00:53:22,541 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2185566.0, ans=0.0 2023-06-29 00:53:25,803 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=2185566.0, ans=0.0 2023-06-29 00:53:28,167 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.60 vs. limit=12.0 2023-06-29 00:53:53,078 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=2185626.0, ans=0.0 2023-06-29 00:54:28,453 INFO [train.py:996] (2/4) Epoch 12, batch 28850, loss[loss=0.2322, simple_loss=0.304, pruned_loss=0.08015, over 21831.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.2944, pruned_loss=0.0715, over 4278237.22 frames. ], batch size: 124, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 00:54:34,299 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2185746.0, ans=0.0 2023-06-29 00:54:39,074 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-29 00:55:07,398 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-29 00:55:18,115 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.464e+02 8.066e+02 1.240e+03 1.993e+03 4.428e+03, threshold=2.479e+03, percent-clipped=20.0 2023-06-29 00:55:35,067 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=2185926.0, ans=0.125 2023-06-29 00:56:02,301 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.58 vs. limit=6.0 2023-06-29 00:56:11,323 INFO [train.py:996] (2/4) Epoch 12, batch 28900, loss[loss=0.2285, simple_loss=0.3063, pruned_loss=0.07533, over 21775.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.2974, pruned_loss=0.07287, over 4275085.98 frames. ], batch size: 298, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 00:56:18,722 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=2186046.0, ans=0.125 2023-06-29 00:57:09,087 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2186166.0, ans=0.0 2023-06-29 00:57:21,397 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.02 vs. limit=15.0 2023-06-29 00:57:53,921 INFO [train.py:996] (2/4) Epoch 12, batch 28950, loss[loss=0.2236, simple_loss=0.3282, pruned_loss=0.05948, over 20769.00 frames. ], tot_loss[loss=0.22, simple_loss=0.2965, pruned_loss=0.07172, over 4278434.61 frames. ], batch size: 608, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 00:58:17,584 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2186406.0, ans=0.125 2023-06-29 00:58:28,774 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=2186406.0, ans=0.2 2023-06-29 00:58:28,818 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=2186406.0, ans=0.0 2023-06-29 00:58:47,799 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.559e+02 9.001e+02 1.307e+03 1.896e+03 3.907e+03, threshold=2.614e+03, percent-clipped=14.0 2023-06-29 00:59:02,409 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=2186526.0, ans=6.0 2023-06-29 00:59:40,848 INFO [train.py:996] (2/4) Epoch 12, batch 29000, loss[loss=0.247, simple_loss=0.3273, pruned_loss=0.08335, over 21287.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.3006, pruned_loss=0.07096, over 4277253.99 frames. ], batch size: 143, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 00:59:56,455 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2186706.0, ans=0.1 2023-06-29 01:00:27,806 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.57 vs. limit=12.0 2023-06-29 01:00:27,845 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=2186766.0, ans=15.0 2023-06-29 01:00:34,825 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=2186766.0, ans=0.0 2023-06-29 01:00:36,253 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=2186766.0, ans=0.2 2023-06-29 01:00:53,519 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.40 vs. limit=15.0 2023-06-29 01:01:21,626 INFO [train.py:996] (2/4) Epoch 12, batch 29050, loss[loss=0.2079, simple_loss=0.2846, pruned_loss=0.06554, over 21441.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.2992, pruned_loss=0.0721, over 4275926.43 frames. ], batch size: 131, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:01:55,524 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2187006.0, ans=0.1 2023-06-29 01:02:11,635 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2187066.0, ans=0.125 2023-06-29 01:02:14,139 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.866e+02 7.703e+02 1.025e+03 1.554e+03 4.084e+03, threshold=2.051e+03, percent-clipped=7.0 2023-06-29 01:02:24,927 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2187126.0, ans=0.1 2023-06-29 01:02:33,032 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=2187126.0, ans=0.09899494936611666 2023-06-29 01:03:02,209 INFO [train.py:996] (2/4) Epoch 12, batch 29100, loss[loss=0.1767, simple_loss=0.244, pruned_loss=0.05469, over 21623.00 frames. ], tot_loss[loss=0.216, simple_loss=0.2915, pruned_loss=0.07023, over 4272202.78 frames. ], batch size: 298, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:03:05,804 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2187246.0, ans=0.125 2023-06-29 01:03:10,724 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2187246.0, ans=0.1 2023-06-29 01:03:24,790 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=2187306.0, ans=0.1 2023-06-29 01:03:28,291 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=2187306.0, ans=0.2 2023-06-29 01:04:16,078 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=2187486.0, ans=0.015 2023-06-29 01:04:38,539 INFO [train.py:996] (2/4) Epoch 12, batch 29150, loss[loss=0.2424, simple_loss=0.3398, pruned_loss=0.07249, over 21622.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.2911, pruned_loss=0.069, over 4276327.51 frames. ], batch size: 389, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:05:26,669 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-29 01:05:30,928 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.044e+02 8.799e+02 1.298e+03 1.831e+03 4.569e+03, threshold=2.596e+03, percent-clipped=20.0 2023-06-29 01:05:41,466 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2187726.0, ans=0.1 2023-06-29 01:05:54,297 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2187726.0, ans=0.125 2023-06-29 01:06:18,430 INFO [train.py:996] (2/4) Epoch 12, batch 29200, loss[loss=0.1915, simple_loss=0.2711, pruned_loss=0.05595, over 20030.00 frames. ], tot_loss[loss=0.2109, simple_loss=0.2863, pruned_loss=0.06774, over 4276547.18 frames. ], batch size: 702, lr: 2.37e-03, grad_scale: 32.0 2023-06-29 01:06:46,473 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2187906.0, ans=0.125 2023-06-29 01:06:51,351 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=2187906.0, ans=0.0 2023-06-29 01:07:10,095 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.63 vs. limit=15.0 2023-06-29 01:07:20,017 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.68 vs. limit=8.0 2023-06-29 01:07:54,952 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2188086.0, ans=0.0 2023-06-29 01:08:03,705 INFO [train.py:996] (2/4) Epoch 12, batch 29250, loss[loss=0.2079, simple_loss=0.3075, pruned_loss=0.0542, over 21751.00 frames. ], tot_loss[loss=0.2087, simple_loss=0.2863, pruned_loss=0.06554, over 4271048.69 frames. ], batch size: 352, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:08:17,045 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2188146.0, ans=0.0 2023-06-29 01:08:23,531 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2188206.0, ans=0.125 2023-06-29 01:08:44,732 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=2188266.0, ans=10.0 2023-06-29 01:08:53,908 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.020e+02 6.977e+02 9.878e+02 1.357e+03 4.006e+03, threshold=1.976e+03, percent-clipped=3.0 2023-06-29 01:09:30,392 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2188386.0, ans=0.125 2023-06-29 01:09:39,902 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.14 vs. limit=15.0 2023-06-29 01:09:43,867 INFO [train.py:996] (2/4) Epoch 12, batch 29300, loss[loss=0.194, simple_loss=0.2676, pruned_loss=0.06024, over 21593.00 frames. ], tot_loss[loss=0.2091, simple_loss=0.2883, pruned_loss=0.06492, over 4274615.30 frames. ], batch size: 263, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:09:55,324 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2188446.0, ans=0.125 2023-06-29 01:10:23,027 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2188506.0, ans=0.125 2023-06-29 01:11:30,175 INFO [train.py:996] (2/4) Epoch 12, batch 29350, loss[loss=0.2152, simple_loss=0.3172, pruned_loss=0.05658, over 21847.00 frames. ], tot_loss[loss=0.2062, simple_loss=0.2835, pruned_loss=0.06443, over 4279083.30 frames. ], batch size: 372, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:12:09,308 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=2188866.0, ans=0.0 2023-06-29 01:12:16,937 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.695e+02 7.385e+02 1.115e+03 1.625e+03 3.431e+03, threshold=2.230e+03, percent-clipped=15.0 2023-06-29 01:12:44,958 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2188926.0, ans=0.125 2023-06-29 01:13:07,858 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.25 vs. limit=15.0 2023-06-29 01:13:11,566 INFO [train.py:996] (2/4) Epoch 12, batch 29400, loss[loss=0.1845, simple_loss=0.2753, pruned_loss=0.04684, over 21698.00 frames. ], tot_loss[loss=0.2031, simple_loss=0.2821, pruned_loss=0.06199, over 4262639.90 frames. ], batch size: 332, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:13:22,208 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2189046.0, ans=0.125 2023-06-29 01:13:36,761 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2189106.0, ans=0.125 2023-06-29 01:13:41,439 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2189106.0, ans=0.1 2023-06-29 01:13:50,089 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2189166.0, ans=0.125 2023-06-29 01:14:43,462 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=2189286.0, ans=10.0 2023-06-29 01:14:52,517 INFO [train.py:996] (2/4) Epoch 12, batch 29450, loss[loss=0.2244, simple_loss=0.3061, pruned_loss=0.07138, over 21725.00 frames. ], tot_loss[loss=0.2017, simple_loss=0.28, pruned_loss=0.06166, over 4257505.86 frames. ], batch size: 332, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:15:15,046 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.00 vs. limit=6.0 2023-06-29 01:15:44,175 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.913e+02 9.244e+02 1.482e+03 2.285e+03 4.603e+03, threshold=2.964e+03, percent-clipped=27.0 2023-06-29 01:16:21,940 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=2189586.0, ans=0.0 2023-06-29 01:16:38,752 INFO [train.py:996] (2/4) Epoch 12, batch 29500, loss[loss=0.2471, simple_loss=0.3074, pruned_loss=0.09338, over 21627.00 frames. ], tot_loss[loss=0.2075, simple_loss=0.2849, pruned_loss=0.06501, over 4260740.08 frames. ], batch size: 471, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:16:43,975 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2189646.0, ans=0.0 2023-06-29 01:16:48,831 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2189646.0, ans=0.125 2023-06-29 01:16:57,291 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.76 vs. limit=15.0 2023-06-29 01:17:12,831 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2189766.0, ans=0.125 2023-06-29 01:17:24,165 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=2189766.0, ans=0.125 2023-06-29 01:18:18,330 INFO [train.py:996] (2/4) Epoch 12, batch 29550, loss[loss=0.2141, simple_loss=0.28, pruned_loss=0.07413, over 21637.00 frames. ], tot_loss[loss=0.2094, simple_loss=0.2854, pruned_loss=0.06671, over 4275386.94 frames. ], batch size: 548, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:18:23,804 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=2189946.0, ans=0.025 2023-06-29 01:18:36,378 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=2190006.0, ans=0.2 2023-06-29 01:18:37,673 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=2190006.0, ans=0.125 2023-06-29 01:18:41,463 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2190006.0, ans=0.125 2023-06-29 01:18:44,744 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2190006.0, ans=0.125 2023-06-29 01:19:05,172 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.716e+02 8.394e+02 1.189e+03 1.876e+03 3.636e+03, threshold=2.379e+03, percent-clipped=6.0 2023-06-29 01:19:21,409 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.37 vs. limit=15.0 2023-06-29 01:19:31,916 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=2190126.0, ans=0.125 2023-06-29 01:19:35,579 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2190126.0, ans=0.125 2023-06-29 01:20:00,974 INFO [train.py:996] (2/4) Epoch 12, batch 29600, loss[loss=0.2356, simple_loss=0.3168, pruned_loss=0.07719, over 21293.00 frames. ], tot_loss[loss=0.2158, simple_loss=0.2929, pruned_loss=0.06936, over 4275385.15 frames. ], batch size: 176, lr: 2.37e-03, grad_scale: 32.0 2023-06-29 01:20:08,458 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2190246.0, ans=0.0 2023-06-29 01:20:14,938 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=2190246.0, ans=0.125 2023-06-29 01:20:17,985 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2190306.0, ans=0.1 2023-06-29 01:20:39,615 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.04 vs. limit=6.0 2023-06-29 01:21:41,169 INFO [train.py:996] (2/4) Epoch 12, batch 29650, loss[loss=0.1729, simple_loss=0.2441, pruned_loss=0.05089, over 21241.00 frames. ], tot_loss[loss=0.2121, simple_loss=0.2907, pruned_loss=0.06669, over 4277591.93 frames. ], batch size: 159, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:21:46,702 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=2190546.0, ans=0.125 2023-06-29 01:22:12,713 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=2190606.0, ans=0.2 2023-06-29 01:22:14,265 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2190606.0, ans=0.0 2023-06-29 01:22:33,826 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.536e+02 9.770e+02 1.872e+03 2.859e+03 6.209e+03, threshold=3.743e+03, percent-clipped=35.0 2023-06-29 01:23:05,053 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2190786.0, ans=0.125 2023-06-29 01:23:22,769 INFO [train.py:996] (2/4) Epoch 12, batch 29700, loss[loss=0.2373, simple_loss=0.347, pruned_loss=0.06381, over 21797.00 frames. ], tot_loss[loss=0.2117, simple_loss=0.2912, pruned_loss=0.06613, over 4277422.90 frames. ], batch size: 282, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:23:34,449 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2190846.0, ans=0.0 2023-06-29 01:24:01,598 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=2190906.0, ans=0.0 2023-06-29 01:24:09,436 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2190966.0, ans=0.1 2023-06-29 01:25:02,359 INFO [train.py:996] (2/4) Epoch 12, batch 29750, loss[loss=0.2316, simple_loss=0.3298, pruned_loss=0.06668, over 21710.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2964, pruned_loss=0.0658, over 4281102.23 frames. ], batch size: 389, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:25:05,153 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=2191146.0, ans=10.0 2023-06-29 01:25:29,097 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2191206.0, ans=0.0 2023-06-29 01:25:58,259 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.524e+02 7.715e+02 1.077e+03 1.535e+03 3.860e+03, threshold=2.154e+03, percent-clipped=1.0 2023-06-29 01:26:42,165 INFO [train.py:996] (2/4) Epoch 12, batch 29800, loss[loss=0.216, simple_loss=0.2907, pruned_loss=0.07068, over 21887.00 frames. ], tot_loss[loss=0.215, simple_loss=0.2972, pruned_loss=0.0664, over 4286531.48 frames. ], batch size: 332, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:27:10,225 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=2191506.0, ans=0.0 2023-06-29 01:27:14,127 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.98 vs. limit=12.0 2023-06-29 01:27:15,572 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.72 vs. limit=10.0 2023-06-29 01:27:41,883 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-29 01:28:07,028 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2191686.0, ans=0.125 2023-06-29 01:28:20,979 INFO [train.py:996] (2/4) Epoch 12, batch 29850, loss[loss=0.1941, simple_loss=0.2683, pruned_loss=0.05993, over 21822.00 frames. ], tot_loss[loss=0.2106, simple_loss=0.2925, pruned_loss=0.06439, over 4292725.31 frames. ], batch size: 282, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:28:29,471 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=2191746.0, ans=0.0 2023-06-29 01:28:45,514 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2191806.0, ans=0.0 2023-06-29 01:29:16,876 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.021e+02 7.804e+02 1.039e+03 1.669e+03 3.761e+03, threshold=2.078e+03, percent-clipped=15.0 2023-06-29 01:29:45,678 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.88 vs. limit=15.0 2023-06-29 01:30:00,610 INFO [train.py:996] (2/4) Epoch 12, batch 29900, loss[loss=0.2205, simple_loss=0.3336, pruned_loss=0.05375, over 19798.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.2917, pruned_loss=0.06549, over 4296369.07 frames. ], batch size: 703, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:30:17,044 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-29 01:31:22,381 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=2192226.0, ans=0.0 2023-06-29 01:31:45,181 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=2192346.0, ans=0.07 2023-06-29 01:31:46,204 INFO [train.py:996] (2/4) Epoch 12, batch 29950, loss[loss=0.2416, simple_loss=0.3093, pruned_loss=0.08697, over 21340.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2959, pruned_loss=0.06864, over 4291888.81 frames. ], batch size: 548, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:32:21,247 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.80 vs. limit=22.5 2023-06-29 01:32:38,614 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.758e+02 9.924e+02 1.385e+03 1.832e+03 3.568e+03, threshold=2.770e+03, percent-clipped=22.0 2023-06-29 01:33:32,982 INFO [train.py:996] (2/4) Epoch 12, batch 30000, loss[loss=0.1951, simple_loss=0.2883, pruned_loss=0.051, over 21723.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.2986, pruned_loss=0.06907, over 4295739.36 frames. ], batch size: 247, lr: 2.37e-03, grad_scale: 32.0 2023-06-29 01:33:32,982 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-29 01:33:51,788 INFO [train.py:1028] (2/4) Epoch 12, validation: loss=0.255, simple_loss=0.3458, pruned_loss=0.08216, over 1796401.00 frames. 2023-06-29 01:33:51,789 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 23793MB 2023-06-29 01:34:38,093 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2192766.0, ans=0.1 2023-06-29 01:35:33,558 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2192886.0, ans=0.0 2023-06-29 01:35:35,472 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2192886.0, ans=0.0 2023-06-29 01:35:38,167 INFO [train.py:996] (2/4) Epoch 12, batch 30050, loss[loss=0.3171, simple_loss=0.4126, pruned_loss=0.1108, over 21468.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.3005, pruned_loss=0.0666, over 4282142.28 frames. ], batch size: 507, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:35:51,910 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=2192946.0, ans=0.025 2023-06-29 01:36:01,800 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2193006.0, ans=0.125 2023-06-29 01:36:36,203 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.629e+02 9.060e+02 1.265e+03 2.367e+03 5.681e+03, threshold=2.530e+03, percent-clipped=16.0 2023-06-29 01:36:41,573 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=2193126.0, ans=0.125 2023-06-29 01:36:53,899 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=2193126.0, ans=0.0 2023-06-29 01:37:17,751 INFO [train.py:996] (2/4) Epoch 12, batch 30100, loss[loss=0.2194, simple_loss=0.284, pruned_loss=0.0774, over 21340.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2992, pruned_loss=0.06576, over 4277048.62 frames. ], batch size: 131, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:38:03,522 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.64 vs. limit=12.0 2023-06-29 01:38:36,110 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=2193426.0, ans=0.125 2023-06-29 01:38:36,839 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.56 vs. limit=6.0 2023-06-29 01:38:38,228 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.25 vs. limit=6.0 2023-06-29 01:38:38,306 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.45 vs. limit=6.0 2023-06-29 01:39:04,270 INFO [train.py:996] (2/4) Epoch 12, batch 30150, loss[loss=0.2294, simple_loss=0.3074, pruned_loss=0.07568, over 21306.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.2951, pruned_loss=0.06707, over 4274808.38 frames. ], batch size: 176, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:39:43,935 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=2193606.0, ans=0.0 2023-06-29 01:40:05,027 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.363e+02 8.482e+02 1.272e+03 2.081e+03 3.656e+03, threshold=2.544e+03, percent-clipped=13.0 2023-06-29 01:40:47,524 INFO [train.py:996] (2/4) Epoch 12, batch 30200, loss[loss=0.2189, simple_loss=0.3082, pruned_loss=0.06478, over 21140.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2976, pruned_loss=0.06575, over 4281242.63 frames. ], batch size: 143, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:41:36,012 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2193966.0, ans=0.0 2023-06-29 01:41:51,600 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.91 vs. limit=6.0 2023-06-29 01:42:00,761 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-29 01:42:02,207 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=2194026.0, ans=0.125 2023-06-29 01:42:07,571 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=2194026.0, ans=0.07 2023-06-29 01:42:33,445 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.16 vs. limit=15.0 2023-06-29 01:42:37,678 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=2194146.0, ans=0.5 2023-06-29 01:42:38,876 INFO [train.py:996] (2/4) Epoch 12, batch 30250, loss[loss=0.2462, simple_loss=0.3394, pruned_loss=0.07649, over 20012.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.3051, pruned_loss=0.06781, over 4275974.76 frames. ], batch size: 702, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:43:23,943 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2194266.0, ans=0.1 2023-06-29 01:43:33,014 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.232e+02 7.981e+02 1.163e+03 1.576e+03 2.909e+03, threshold=2.325e+03, percent-clipped=5.0 2023-06-29 01:43:33,482 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=2194266.0, ans=0.0 2023-06-29 01:44:04,521 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=2194386.0, ans=0.09899494936611666 2023-06-29 01:44:21,067 INFO [train.py:996] (2/4) Epoch 12, batch 30300, loss[loss=0.1809, simple_loss=0.2584, pruned_loss=0.05174, over 21930.00 frames. ], tot_loss[loss=0.2181, simple_loss=0.3011, pruned_loss=0.0676, over 4270473.27 frames. ], batch size: 113, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:44:50,714 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=2194506.0, ans=10.0 2023-06-29 01:44:51,960 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2194506.0, ans=0.125 2023-06-29 01:44:57,044 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=2194506.0, ans=0.09899494936611666 2023-06-29 01:45:53,538 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2194686.0, ans=0.125 2023-06-29 01:45:55,247 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2194686.0, ans=0.125 2023-06-29 01:46:09,183 INFO [train.py:996] (2/4) Epoch 12, batch 30350, loss[loss=0.2233, simple_loss=0.2981, pruned_loss=0.07426, over 21370.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.3, pruned_loss=0.06869, over 4270357.83 frames. ], batch size: 194, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:46:26,734 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2194806.0, ans=0.0 2023-06-29 01:46:41,807 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=10.28 vs. limit=15.0 2023-06-29 01:46:49,129 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.687e+02 9.388e+02 1.588e+03 2.178e+03 4.101e+03, threshold=3.176e+03, percent-clipped=21.0 2023-06-29 01:46:49,482 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2194866.0, ans=0.125 2023-06-29 01:47:04,394 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.31 vs. limit=22.5 2023-06-29 01:47:26,698 INFO [train.py:996] (2/4) Epoch 12, batch 30400, loss[loss=0.1946, simple_loss=0.2463, pruned_loss=0.07147, over 20237.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2948, pruned_loss=0.06773, over 4264537.91 frames. ], batch size: 703, lr: 2.37e-03, grad_scale: 32.0 2023-06-29 01:47:43,641 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2195106.0, ans=0.1 2023-06-29 01:48:18,924 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2195226.0, ans=0.125 2023-06-29 01:48:25,977 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=2195226.0, ans=0.0 2023-06-29 01:48:50,488 INFO [train.py:996] (2/4) Epoch 12, batch 30450, loss[loss=0.2516, simple_loss=0.3745, pruned_loss=0.06439, over 19781.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2957, pruned_loss=0.06744, over 4204775.34 frames. ], batch size: 702, lr: 2.37e-03, grad_scale: 8.0 2023-06-29 01:49:17,223 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2195406.0, ans=0.125 2023-06-29 01:49:34,188 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2195466.0, ans=0.0 2023-06-29 01:49:38,086 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.733e+02 1.475e+03 2.498e+03 5.657e+03 1.532e+04, threshold=4.997e+03, percent-clipped=41.0 2023-06-29 01:49:49,878 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2195526.0, ans=0.125 2023-06-29 01:49:57,824 INFO [train.py:1249] (2/4) Done!